This is part of a series on the blog where we explore RISC-V by breaking down real programs and explaining how they work. You can view all posts in this series on the RISC-V Bytes page.

I once took a class on compilers where my professor told us that a CPU is like a human brain: it can store important data and access it quickly, but there is a limit to the amount of data that can be stored. When that limit is reached, it must store data elsewhere. For instance, when doing math, most humans find it useful to write different steps of the operations down on a piece of paper because the larger the computation, the harder it is to keep track of all of its components. Likewise, a CPU can store the most critical data in easy to access locations, but must eventually move information farther down the memory hierarchy when the computation becomes sufficiently complex.

Revisiting Our Last Post Link to heading

In our most recent post, we primarily looked at the easiest to access memory locations: registers. We specifically looked at how registers are used to communicate between procedures via calling conventions. However, we also saw that callee-saved registers, such as the stack pointer (sp) that needed to be re-used within a procedure had to have their contents stored on the stack, then loaded back into the appropriate register before returning. Storing these registers on the stack is an example of moving the data down the memory hierarchy.

Let’s look back at the source for that program:

sum.c


#include <stdio.h>

int main()
{
    int num1 = 1;
    int num2 = 2;
    int sum = num1 + num2;

    printf("The sum is: %d", sum);
    return 0;
}

This program is needlessly complex: the result of our addition will always be constant. However, because we compiled without any optimization, these wasteful operations were preserved in the generated assembly:


(gdb) disass main
Dump of assembler code for function main:
   0x0000000000010158 <+0>:     addi       sp,sp,-32
   0x000000000001015a <+2>:     sd         ra,24(sp)
   0x000000000001015c <+4>:     sd         s0,16(sp)
   0x000000000001015e <+6>:     addi       s0,sp,32
   0x0000000000010160 <+8>:     li         a5,1
   0x0000000000010162 <+10>:    sw         a5,-20(s0)
   0x0000000000010166 <+14>:    li         a5,2
   0x0000000000010168 <+16>:    sw         a5,-24(s0)
   0x000000000001016c <+20>:    lw         a4,-20(s0)
   0x0000000000010170 <+24>:    lw         a5,-24(s0)
   0x0000000000010174 <+28>:    addw       a5,a5,a4
   0x0000000000010176 <+30>:    sw         a5,-28(s0)
   0x000000000001017a <+34>:    lw         a5,-28(s0)
   0x000000000001017e <+38>:    mv         a1,a5
   0x0000000000010180 <+40>:    lui        a5,0x1c
   0x0000000000010182 <+42>:    addi       a0,a5,176 # 0x1c0b0
   0x0000000000010186 <+46>:    jal        ra,0x10332 <printf>
   0x000000000001018a <+50>:    li         a5,0
   0x000000000001018c <+52>:    mv         a0,a5
   0x000000000001018e <+54>:    ld         ra,24(sp)
   0x0000000000010190 <+56>:    ld         s0,16(sp)
   0x0000000000010192 <+58>:    addi       sp,sp,32
   0x0000000000010194 <+60>:    ret
End of assembler dump.

View on Compiler Explorer

See the first post in this series for how to set up cross-platform compilation and debugging for RISC-V.

In fact, the generated assembly is even more wasteful. Ignoring the function prologue and epilogue, the procedure body not only performs all of our computations (<+28>), but also does not make use of all available registers, forcing us to store all data on the stack. A particularly egregious example is when we initialize num1 (<+8>) and num2 (<+14>), using a5 in both cases, forcing each value to be stored on the stack (<+10>, <+16>).

If we employed full optimization by passing -O3 to our compiler, we would get a much more sensible output where we skip addition altogether, instead loading 3 as an immediate value, which will always be the result of the operation (<+4>).

riscv64-unriscv64-unknown-elf-gcc -O3 sum.c


(gdb) disass main
Dump of assembler code for function main:
   0x00000000000100b0 <+0>:     lui     a0,0x1c
   0x00000000000100b2 <+2>:     addi    sp,sp,-16
   0x00000000000100b4 <+4>:     li      a1,3
   0x00000000000100b6 <+6>:     addi    a0,a0,144 # 0x1c090
   0x00000000000100ba <+10>:    sd      ra,8(sp)
   0x00000000000100bc <+12>:    jal     ra,0x1030c <printf>
   0x00000000000100c0 <+16>:    ld      ra,8(sp)
   0x00000000000100c2 <+18>:    li      a0,0
   0x00000000000100c4 <+20>:    addi    sp,sp,16
   0x00000000000100c6 <+22>:    ret
End of assembler dump.

View on Compiler Explorer

What we are illustrating here is efficient use of registers, avoiding moving down the memory hierarchy unless we absolutely have to, such as when storing the return address of our caller (<+10>).

Sharing “Large” Data Link to heading

Today we want to look at what happens when we are passing data between procedures and we have too much data to store in our argument registers. Let’s take another look at our general purpose registers in RISC-V:

Name ABI Mnemonic Calling Convention Preserved across calls?
x0 zero Zero n/a
x1 ra Return address No
x2 sp Stack pointer Yes
x3 gp Global pointer n/a
x4 tp Thread pointer n/a
x5-x7 t0-t2 Temporary registers No
x8-x9 s0-s1 Saved registers Yes
x10-x17 a0-a7 Argument registers No
x18-x27 s2-s11 Saved registers Yes
x28-x31 t3-t6 Temporary registers No

The “argument registers” are where we store data that we want to share with a procedure we are calling. When passing minimal data between procedures, this isn’t a problem:

minimal.c


#include <stdio.h>

int sum(int one, int two) {
    return one + two;
}

int main() {
    printf("The sum is: %d\n", sum(1, 2));
    return 0;
}

riscv64-unknown-elf-gcc -O3 -fno-inline minimal.c


(gdb) disass main
Dump of assembler code for function main:
   0x00000000000100b0 <+0>:     addi    sp,sp,-16
   0x00000000000100b2 <+2>:     li      a1,2
   0x00000000000100b4 <+4>:     li      a0,1
   0x00000000000100b6 <+6>:     sd      ra,8(sp)
   0x00000000000100b8 <+8>:     jal     ra,0x10178 <sum>
   0x00000000000100bc <+12>:    mv      a1,a0
   0x00000000000100be <+14>:    lui     a0,0x1c
   0x00000000000100c0 <+16>:    addi    a0,a0,160 # 0x1c0a0
   0x00000000000100c4 <+20>:    jal     ra,0x10318 <printf>
   0x00000000000100c8 <+24>:    ld      ra,8(sp)
   0x00000000000100ca <+26>:    li      a0,0
   0x00000000000100cc <+28>:    addi    sp,sp,16
   0x00000000000100ce <+30>:    ret
End of assembler dump.
(gdb) disass sum
Dump of assembler code for function sum:
   0x0000000000010178 <+0>:     addw    a0,a0,a1
   0x000000000001017a <+2>:     ret
End of assembler dump.

View on Compiler Explorer

We pass -fno-inline during compilation because we want to preserve the call to sum and the passing of data between the procedures. Without it, at any optimization level >= 1, GCC will inline the sum function.

We load our arguments into our argument registers (main:<+2>,main:<+4>), then perform our addition in sum using those registers. We even re-use a0 to pass our return value back to main (sum:<+0>), which we are permitted to do because argument registers are not callee-saved (RISC-V calling conventions also specify that that a0 and a1 are to be used for return values).

So what happens when we can’t fit all of our arguments into the argument registers? Similar to how we preserved register contents within a procedure by storing them on the stack, we can also pass data between procedures on the stack. Let’s expand our minimal example with more data:

passonstack.c


#include <stdio.h>

int sum(int one, int two, int three, int four, int five, int six, int seven, int eight, int nine) {
    return one + two + three + four + five + six + seven + eight + nine;
}

int main() {
    printf("The sum is: %d\n", sum(1, 2, 3, 4, 5, 6, 7, 8, 9));
    return 0;
}

riscv64-unknown-elf-gcc -O3 -fno-inline passonstack.c


(gdb) disass main
Dump of assembler code for function main:
   0x00000000000100b0 <+0>:     addi    sp,sp,-32
   0x00000000000100b2 <+2>:     li      a1,9
   0x00000000000100b4 <+4>:     sd      a1,0(sp)
   0x00000000000100b6 <+6>:     li      a7,8
   0x00000000000100b8 <+8>:     li      a6,7
   0x00000000000100ba <+10>:    li      a5,6
   0x00000000000100bc <+12>:    li      a4,5
   0x00000000000100be <+14>:    li      a3,4
   0x00000000000100c0 <+16>:    li      a2,3
   0x00000000000100c2 <+18>:    li      a1,2
   0x00000000000100c4 <+20>:    li      a0,1
   0x00000000000100c6 <+22>:    sd      ra,24(sp)
   0x00000000000100c8 <+24>:    jal     ra,0x10188 <sum>
   0x00000000000100cc <+28>:    mv      a1,a0
   0x00000000000100ce <+30>:    lui     a0,0x1c
   0x00000000000100d0 <+32>:    addi    a0,a0,192 # 0x1c0c0
   0x00000000000100d4 <+36>:    jal     ra,0x1033c <printf>
   0x00000000000100d8 <+40>:    ld      ra,24(sp)
   0x00000000000100da <+42>:    li      a0,0
   0x00000000000100dc <+44>:    addi    sp,sp,32
   0x00000000000100de <+46>:    ret
End of assembler dump.
(gdb) disass sum
Dump of assembler code for function sum:
   0x0000000000010188 <+0>:     addw    a1,a1,a0
   0x000000000001018a <+2>:     addw    a1,a1,a2
   0x000000000001018c <+4>:     addw    a1,a1,a3
   0x000000000001018e <+6>:     addw    a1,a1,a4
   0x0000000000010190 <+8>:     addw    a1,a1,a5
   0x0000000000010192 <+10>:    lw      a0,0(sp)
   0x0000000000010194 <+12>:    addw    a1,a1,a6
   0x0000000000010198 <+16>:    addw    a1,a1,a7
   0x000000000001019c <+20>:    addw    a0,a0,a1
   0x000000000001019e <+22>:    ret
End of assembler dump.

View on Compiler Explorer

The concept of storing data on the stack when we run out of registers is commonly referred to as “register spilling”. Compilers typically want to reduce spilling registers as much as possible.

We once again are utilizing our argument registers to pass our arguments to sum, but because we are passing nine integers and only have eight argument registers, we must store one of our arguments on the stack. How do we know where to place our “spilled” argument on the stack? The RISC-V calling conventions specify:

The stack grows downwards (towards lower addresses) and the stack pointer shall be aligned to a 128-bit boundary upon procedure entry. The first argument passed on the stack is located at offset zero of the stack pointer on function entry; following arguments are stored at correspondingly higher addresses.

We could test this out by passing a tenth argument and seeing that it is stored at an offset of 8 bytes from the stack pointer:


(gdb) disass sum
Dump of assembler code for function sum:
   0x000000000001018c <+0>:     addw    a1,a1,a0
   0x000000000001018e <+2>:     addw    a1,a1,a2
   0x0000000000010190 <+4>:     addw    a1,a1,a3
   0x0000000000010192 <+6>:     addw    a1,a1,a4
   0x0000000000010194 <+8>:     addw    a1,a1,a5
   0x0000000000010196 <+10>:    addw    a1,a1,a6
   0x000000000001019a <+14>:    addw    a1,a1,a7
   0x000000000001019e <+18>:    lw      a7,0(sp)
   0x00000000000101a0 <+20>:    lw      a0,8(sp)
   0x00000000000101a2 <+22>:    addw    a1,a1,a7
   0x00000000000101a6 <+26>:    addw    a0,a0,a1
   0x00000000000101a8 <+28>:    ret
End of assembler dump.

View on Compiler Explorer

These are clearly contrived examples (exemplified by the fact that we have to force the compiler not to eliminate our call to the sum function entirely), but serve to get us thinking about how the data we share between procedures affects our memory access patterns.

Concluding Thoughts Link to heading

Understanding the memory hierarchy of a computer and what operations cause it to move to a lower (and slower) level in the hierarchy allow us to be more effective programmers. While disassembling and examining every function in a program is not a feasible option, building up an intuition for how a certain operation may impact the performance of an application can lead to better designed systems.

As always, these posts are meant to serve as a useful resource for folks who are interested in learning more about RISC-V and low-level software in general. If I can do a better job of reaching that goal, or you have any questions or comments, please feel free to send me a message @hasheddan on Twitter!