This is part of a series on the blog where we explore RISC-V by breaking down real programs and explaining how they work. You can view all posts in this series on the RISC-V Bytes page.
I once took a class on compilers where my professor told us that a CPU is like a human brain: it can store important data and access it quickly, but there is a limit to the amount of data that can be stored. When that limit is reached, it must store data elsewhere. For instance, when doing math, most humans find it useful to write different steps of the operations down on a piece of paper because the larger the computation, the harder it is to keep track of all of its components. Likewise, a CPU can store the most critical data in easy to access locations, but must eventually move information farther down the memory hierarchy when the computation becomes sufficiently complex.
Revisiting Our Last Post Link to heading
In our most recent
post, we
primarily looked at the easiest to access memory locations: registers. We
specifically looked at how registers are used to communicate between procedures
via calling conventions. However, we also saw that callee-saved registers,
such as the stack pointer (sp
) that needed to be re-used within a procedure
had to have their contents stored on the stack, then loaded back into the
appropriate register before returning. Storing these registers on the stack is
an example of moving the data down the memory hierarchy.
Let’s look back at the source for that program:
sum.c
#include <stdio.h>
int main()
{
int num1 = 1;
int num2 = 2;
int sum = num1 + num2;
printf("The sum is: %d", sum);
return 0;
}
This program is needlessly complex: the result of our addition will always be constant. However, because we compiled without any optimization, these wasteful operations were preserved in the generated assembly:
(gdb) disass main
Dump of assembler code for function main:
0x0000000000010158 <+0>: addi sp,sp,-32
0x000000000001015a <+2>: sd ra,24(sp)
0x000000000001015c <+4>: sd s0,16(sp)
0x000000000001015e <+6>: addi s0,sp,32
0x0000000000010160 <+8>: li a5,1
0x0000000000010162 <+10>: sw a5,-20(s0)
0x0000000000010166 <+14>: li a5,2
0x0000000000010168 <+16>: sw a5,-24(s0)
0x000000000001016c <+20>: lw a4,-20(s0)
0x0000000000010170 <+24>: lw a5,-24(s0)
0x0000000000010174 <+28>: addw a5,a5,a4
0x0000000000010176 <+30>: sw a5,-28(s0)
0x000000000001017a <+34>: lw a5,-28(s0)
0x000000000001017e <+38>: mv a1,a5
0x0000000000010180 <+40>: lui a5,0x1c
0x0000000000010182 <+42>: addi a0,a5,176 # 0x1c0b0
0x0000000000010186 <+46>: jal ra,0x10332 <printf>
0x000000000001018a <+50>: li a5,0
0x000000000001018c <+52>: mv a0,a5
0x000000000001018e <+54>: ld ra,24(sp)
0x0000000000010190 <+56>: ld s0,16(sp)
0x0000000000010192 <+58>: addi sp,sp,32
0x0000000000010194 <+60>: ret
End of assembler dump.
See the first post in this series for how to set up cross-platform compilation and debugging for RISC-V.
In fact, the generated assembly is even more wasteful. Ignoring the function
prologue and epilogue, the procedure body not only performs all of our
computations (<+28>
), but also does not make use of all available registers,
forcing us to store all data on the stack. A particularly egregious example is
when we initialize num1
(<+8>
) and num2
(<+14>
), using a5
in both
cases, forcing each value to be stored on the stack (<+10>
, <+16>
).
If we employed full optimization by passing -O3
to our compiler, we would get
a much more sensible output where we skip addition altogether, instead loading
3
as an immediate value, which will always be the result of the operation
(<+4>
).
riscv64-unriscv64-unknown-elf-gcc -O3 sum.c
(gdb) disass main
Dump of assembler code for function main:
0x00000000000100b0 <+0>: lui a0,0x1c
0x00000000000100b2 <+2>: addi sp,sp,-16
0x00000000000100b4 <+4>: li a1,3
0x00000000000100b6 <+6>: addi a0,a0,144 # 0x1c090
0x00000000000100ba <+10>: sd ra,8(sp)
0x00000000000100bc <+12>: jal ra,0x1030c <printf>
0x00000000000100c0 <+16>: ld ra,8(sp)
0x00000000000100c2 <+18>: li a0,0
0x00000000000100c4 <+20>: addi sp,sp,16
0x00000000000100c6 <+22>: ret
End of assembler dump.
What we are illustrating here is efficient use of registers, avoiding moving
down the memory hierarchy unless we absolutely have to, such as when storing the
return address of our caller (<+10>
).
Sharing “Large” Data Link to heading
Today we want to look at what happens when we are passing data between procedures and we have too much data to store in our argument registers. Let’s take another look at our general purpose registers in RISC-V:
Name | ABI Mnemonic | Calling Convention | Preserved across calls? |
---|---|---|---|
x0 | zero | Zero | n/a |
x1 | ra | Return address | No |
x2 | sp | Stack pointer | Yes |
x3 | gp | Global pointer | n/a |
x4 | tp | Thread pointer | n/a |
x5-x7 | t0-t2 | Temporary registers | No |
x8-x9 | s0-s1 | Saved registers | Yes |
x10-x17 | a0-a7 | Argument registers | No |
x18-x27 | s2-s11 | Saved registers | Yes |
x28-x31 | t3-t6 | Temporary registers | No |
The “argument registers” are where we store data that we want to share with a procedure we are calling. When passing minimal data between procedures, this isn’t a problem:
minimal.c
#include <stdio.h>
int sum(int one, int two) {
return one + two;
}
int main() {
printf("The sum is: %d\n", sum(1, 2));
return 0;
}
riscv64-unknown-elf-gcc -O3 -fno-inline minimal.c
(gdb) disass main
Dump of assembler code for function main:
0x00000000000100b0 <+0>: addi sp,sp,-16
0x00000000000100b2 <+2>: li a1,2
0x00000000000100b4 <+4>: li a0,1
0x00000000000100b6 <+6>: sd ra,8(sp)
0x00000000000100b8 <+8>: jal ra,0x10178 <sum>
0x00000000000100bc <+12>: mv a1,a0
0x00000000000100be <+14>: lui a0,0x1c
0x00000000000100c0 <+16>: addi a0,a0,160 # 0x1c0a0
0x00000000000100c4 <+20>: jal ra,0x10318 <printf>
0x00000000000100c8 <+24>: ld ra,8(sp)
0x00000000000100ca <+26>: li a0,0
0x00000000000100cc <+28>: addi sp,sp,16
0x00000000000100ce <+30>: ret
End of assembler dump.
(gdb) disass sum
Dump of assembler code for function sum:
0x0000000000010178 <+0>: addw a0,a0,a1
0x000000000001017a <+2>: ret
End of assembler dump.
We pass
-fno-inline
during compilation because we want to preserve the call tosum
and the passing of data between the procedures. Without it, at any optimization level >= 1, GCC will inline thesum
function.
We load our arguments into our argument registers (main:<+2>
,main:<+4>
),
then perform our addition in sum
using those registers. We even re-use a0
to
pass our return value back to main
(sum:<+0>
), which we are permitted to do
because argument registers are not callee-saved (RISC-V calling
conventions
also specify that that a0
and a1
are to be used for return values).
So what happens when we can’t fit all of our arguments into the argument registers? Similar to how we preserved register contents within a procedure by storing them on the stack, we can also pass data between procedures on the stack. Let’s expand our minimal example with more data:
passonstack.c
#include <stdio.h>
int sum(int one, int two, int three, int four, int five, int six, int seven, int eight, int nine) {
return one + two + three + four + five + six + seven + eight + nine;
}
int main() {
printf("The sum is: %d\n", sum(1, 2, 3, 4, 5, 6, 7, 8, 9));
return 0;
}
riscv64-unknown-elf-gcc -O3 -fno-inline passonstack.c
(gdb) disass main
Dump of assembler code for function main:
0x00000000000100b0 <+0>: addi sp,sp,-32
0x00000000000100b2 <+2>: li a1,9
0x00000000000100b4 <+4>: sd a1,0(sp)
0x00000000000100b6 <+6>: li a7,8
0x00000000000100b8 <+8>: li a6,7
0x00000000000100ba <+10>: li a5,6
0x00000000000100bc <+12>: li a4,5
0x00000000000100be <+14>: li a3,4
0x00000000000100c0 <+16>: li a2,3
0x00000000000100c2 <+18>: li a1,2
0x00000000000100c4 <+20>: li a0,1
0x00000000000100c6 <+22>: sd ra,24(sp)
0x00000000000100c8 <+24>: jal ra,0x10188 <sum>
0x00000000000100cc <+28>: mv a1,a0
0x00000000000100ce <+30>: lui a0,0x1c
0x00000000000100d0 <+32>: addi a0,a0,192 # 0x1c0c0
0x00000000000100d4 <+36>: jal ra,0x1033c <printf>
0x00000000000100d8 <+40>: ld ra,24(sp)
0x00000000000100da <+42>: li a0,0
0x00000000000100dc <+44>: addi sp,sp,32
0x00000000000100de <+46>: ret
End of assembler dump.
(gdb) disass sum
Dump of assembler code for function sum:
0x0000000000010188 <+0>: addw a1,a1,a0
0x000000000001018a <+2>: addw a1,a1,a2
0x000000000001018c <+4>: addw a1,a1,a3
0x000000000001018e <+6>: addw a1,a1,a4
0x0000000000010190 <+8>: addw a1,a1,a5
0x0000000000010192 <+10>: lw a0,0(sp)
0x0000000000010194 <+12>: addw a1,a1,a6
0x0000000000010198 <+16>: addw a1,a1,a7
0x000000000001019c <+20>: addw a0,a0,a1
0x000000000001019e <+22>: ret
End of assembler dump.
The concept of storing data on the stack when we run out of registers is commonly referred to as “register spilling”. Compilers typically want to reduce spilling registers as much as possible.
We once again are utilizing our argument registers to pass our arguments to
sum
, but because we are passing nine integers and only have eight argument
registers, we must store one of our arguments on the stack. How do we know where
to place our “spilled” argument on the stack? The RISC-V calling
conventions
specify:
The stack grows downwards (towards lower addresses) and the stack pointer shall be aligned to a 128-bit boundary upon procedure entry. The first argument passed on the stack is located at offset zero of the stack pointer on function entry; following arguments are stored at correspondingly higher addresses.
We could test this out by passing a tenth argument and seeing that it is stored at an offset of 8 bytes from the stack pointer:
(gdb) disass sum
Dump of assembler code for function sum:
0x000000000001018c <+0>: addw a1,a1,a0
0x000000000001018e <+2>: addw a1,a1,a2
0x0000000000010190 <+4>: addw a1,a1,a3
0x0000000000010192 <+6>: addw a1,a1,a4
0x0000000000010194 <+8>: addw a1,a1,a5
0x0000000000010196 <+10>: addw a1,a1,a6
0x000000000001019a <+14>: addw a1,a1,a7
0x000000000001019e <+18>: lw a7,0(sp)
0x00000000000101a0 <+20>: lw a0,8(sp)
0x00000000000101a2 <+22>: addw a1,a1,a7
0x00000000000101a6 <+26>: addw a0,a0,a1
0x00000000000101a8 <+28>: ret
End of assembler dump.
These are clearly contrived examples (exemplified by the fact that we have to
force the compiler not to eliminate our call to the sum
function entirely),
but serve to get us thinking about how the data we share between procedures
affects our memory access patterns.
Concluding Thoughts Link to heading
Understanding the memory hierarchy of a computer and what operations cause it to move to a lower (and slower) level in the hierarchy allow us to be more effective programmers. While disassembling and examining every function in a program is not a feasible option, building up an intuition for how a certain operation may impact the performance of an application can lead to better designed systems.
As always, these posts are meant to serve as a useful resource for folks who are interested in learning more about RISC-V and low-level software in general. If I can do a better job of reaching that goal, or you have any questions or comments, please feel free to send me a message @hasheddan on Twitter!