In my recent post on
the PSA Crypto API, I
demonstrated the use of the API on two different MCUs: the
nRF52840 and the
ESP32-S3. In the case of
the former, the ECDSA signature operation was eventually executed in a closed
source library that manages communication between the Arm Cortex-M4 processor
and the Arm TrustZone CryptoCell
310
security subsytem. Readers that ventured down the rabbit hole of links in the
post may have noticed that there are
variants
of the nrf_cc310_mbedcrypto libraries for hard-float and soft-float. If
you have ever hit an error from your linker of the following style, you know
exactly why.
ld.bfd: error: X uses VFP register arguments, Y does not
ld.bfd: failed to merge target specific data of file
Arm
defines
three floating point Application Binary Interface (ABI) options, which are
controlled by the -mfloat-abi compiler flag.
soft: Soft ABI without FPU hardare: All floating-point operations are handled by the runtime library functions. Values are passed through integer register bank.softfp: Soft ABI with FPU hardware: This allows the compiled code to generate codes that directly access the FPU. But, if a calculation needs to use a runtime library function, a soft-float calling convention is used. Values are passed through integer register bank.hard: Hard ABI: This allows the compiled code to generate codes that directly accesss the FPU and use FPU-specific calling conventions when calling runtime library functions.
Arm, like most Instruction Set Architectures (ISAs), passes arguments to
subroutines
in general purpose registers (GPRs), specifically r0-r3. When the number or
size of arguments exceeds the available GPRs, the remaining arguments are
“spilled”
to the stack, where they can be accessed by the callee. However, when a
processor includes a Floating Point Unit (FPU) (more specifically for Armv7-M
processors, the C10 and C11
coprocessors),
and thus the floating point
extension,
there is an additional register bank with 32 floating point
registers
(s0-s31).
Side note: you may see the term Vector Floating Point (VFP) when referring to floating point on Armv7-M processors, such as the Cortex-M4. The reference manual explains why this is the case: “In the ARMv7-A and ARMv7-R architecture profiles, floating point instructions are called VFP instructions and have mnemonics starting with V. Because ARM assembler is highly consistent across architecture versions and profiles, ARMv7-M retains these mnemonics, but normally describes the instructions as floating point instructions, or FP instructions.”
When using the hard ABI, the s0-s15 registers can be used for passing
arguments to
subroutines.
The use of hard also indicates that floating point instructions (load and
store,
register
transfer,
data
processing)
may be used within routines.
When using softfp, floating point instructions are allowed within routines,
but arguments cannot be passed in floating point registers. soft uses the same
calling convention as softfp, and is thus compatible, but does not allow for
the use of floating point instructions. When floating point operations are
performed without support for floating point instructions, they must be
emulated in
software.
When you see the error described at the beginning of this post, you are mixing
soft/softfp with hard, which the linker will refuse. It is able to
determine the ABI of an object file being linked by looking at the Arm
attributes section, which differs for each variant. For example, on the
nRF52840, the attributes appear as follows (extracted via readelf).
hard
Attribute Section: aeabi
File Attributes
Tag_CPU_name: "7E-M"
Tag_CPU_arch: v7E-M
Tag_CPU_arch_profile: Microcontroller
Tag_THUMB_ISA_use: Thumb-2
Tag_FP_arch: VFPv4-D16
Tag_ABI_PCS_wchar_t: 4
Tag_ABI_FP_denormal: Needed
Tag_ABI_FP_exceptions: Needed
Tag_ABI_FP_number_model: IEEE 754
Tag_ABI_align_needed: 8-byte
Tag_ABI_align_preserved: 8-byte, except leaf SP
Tag_ABI_enum_size: small
Tag_ABI_HardFP_use: SP only
Tag_ABI_VFP_args: VFP registers
Tag_ABI_optimization_goals: Aggressive Speed
Tag_CPU_unaligned_access: v6
softfp
Attribute Section: aeabi
File Attributes
Tag_CPU_name: "7E-M"
Tag_CPU_arch: v7E-M
Tag_CPU_arch_profile: Microcontroller
Tag_THUMB_ISA_use: Thumb-2
Tag_FP_arch: VFPv4-D16
Tag_ABI_PCS_wchar_t: 4
Tag_ABI_FP_rounding: Needed
Tag_ABI_FP_denormal: Needed
Tag_ABI_FP_exceptions: Needed
Tag_ABI_FP_user_exceptions: Needed
Tag_ABI_FP_number_model: IEEE 754
Tag_ABI_align_needed: 8-byte
Tag_ABI_enum_size: small
Tag_ABI_HardFP_use: SP only
Tag_ABI_optimization_goals: Aggressive Size
Tag_CPU_unaligned_access: v6
Tag_ABI_FP_16bit_format: IEEE 754
soft
Attribute Section: aeabi
File Attributes
Tag_CPU_name: "7E-M"
Tag_CPU_arch: v7E-M
Tag_CPU_arch_profile: Microcontroller
Tag_THUMB_ISA_use: Thumb-2
Tag_ABI_PCS_wchar_t: 4
Tag_ABI_FP_denormal: Needed
Tag_ABI_FP_exceptions: Needed
Tag_ABI_FP_number_model: IEEE 754
Tag_ABI_align_needed: 8-byte
Tag_ABI_align_preserved: 8-byte, except leaf SP
Tag_ABI_enum_size: small
Tag_ABI_optimization_goals: Aggressive Speed
Tag_CPU_unaligned_access: v6
Floating Point ABIs in Practice Link to heading
The output of compiling with each ABI can be observed with a simple function
like the following addf with optimizations turned off to ensure that it is not
inlined or stripped entirely.
float __attribute__((optimize("O0"))) addf(float a, float b)
{
return a + b;
}
If invoking the compiler directly, you can pass the -mfloat-abi option, but if
you are leveraging a build system you may need to identify configuration
specific to the platform. For example, when using
Zephyr, the FPU is enabled by
setting CONFIG_FPU=y. The option depends
on
CONFIG_CPU_HAS_FPU, but defaults to false even when an FPU is present.
config FPU
bool "Floating point unit (FPU)"
depends on CPU_HAS_FPU
help
This option enables the hardware Floating Point Unit (FPU), in order to
support using the floating point registers and instructions.
When this option is enabled, by default, threads may use the floating
point registers only in an exclusive manner, and this usually means that
only one thread may perform floating point operations.
If it is necessary for multiple threads to perform concurrent floating
point operations, the "FPU register sharing" option must be enabled to
preserve the floating point registers across context switches.
Note that this option cannot be selected for the platforms that do not
include a hardware floating point unit; the floating point support for
those platforms is dependent on the availability of the toolchain-
provided software floating point library.
Because the nRF52840 has an FPU, CONFIG_CPU_HAS_FPU is
selected.
config SOC_NRF52840
select CPU_CORTEX_M_HAS_DWT
select CPU_HAS_FPU
With the FPU being off by default, west build will result in the use of the
soft ABI.
west build -p -b nrf52840dk/nrf52840 .
As expected, arguments are passed in GPRs, and the floating point addition
operation makes a call to the runtime library function __addsf3, which
implements the operation in software.
0001a860 <addf>:
1a860: b580 push {r7, lr}
1a862: b082 sub sp, #8
1a864: af00 add r7, sp, #0
1a866: 6078 str r0, [r7, #4]
1a868: 6039 str r1, [r7, #0]
1a86a: 6839 ldr r1, [r7, #0]
1a86c: 6878 ldr r0, [r7, #4]
1a86e: f7e5 fc4d bl 10c <__addsf3>
1a872: 4603 mov r3, r0
1a874: 4618 mov r0, r3
1a876: 3708 adds r7, #8
1a878: 46bd mov sp, r7
1a87a: bd80 pop {r7, pc}
When setting CONFIG_FPU=y and building again, the hard float ABI will be
used by default, and the floating point arguments will instead be passed to
addf in floating point registers, then vadd.f32 will be used to perform the
addition.
0001b5ae <addf>:
1b5ae: b480 push {r7}
1b5b0: b083 sub sp, #12
1b5b2: af00 add r7, sp, #0
1b5b4: ed87 0a01 vstr s0, [r7, #4]
1b5b8: edc7 0a00 vstr s1, [r7]
1b5bc: ed97 7a01 vldr s14, [r7, #4]
1b5c0: edd7 7a00 vldr s15, [r7]
1b5c4: ee77 7a27 vadd.f32 s15, s14, s15
1b5c8: eeb0 0a67 vmov.f32 s0, s15
1b5cc: 370c adds r7, #12
1b5ce: 46bd mov sp, r7
1b5d0: f85d 7b04 ldr.w r7, [sp], #4
1b5d4: 4770 bx lr
Setting
CONFIG_FP_SOFTABI=y
will instead result in the use of softfp.
choice
prompt "Floating point ABI"
default FP_HARDABI
depends on FPU
config FP_HARDABI
bool "Floating point Hard ABI"
help
This option selects the Floating point ABI in which hardware floating
point instructions are generated and uses FPU-specific calling
conventions.
config FP_SOFTABI
bool "Floating point Soft ABI"
help
This option selects the Floating point ABI in which hardware floating
point instructions are generated but soft-float calling conventions.
endchoice
The presence of the flag can be observed by providing the -v flag (i.e. west -v build).
-mcpu=cortex-m4 -mthumb -mabi=aapcs -mfpu=fpv4-sp-d16 -mfloat-abi=softfp -mfp16-format=ieee
Finally, the observed output includes the soft calling convention (arguments passed in GPRs), but still utilizes the floating point registers and instructions within the routine.
0001b59a <addf>:
1b59a: b480 push {r7}
1b59c: b083 sub sp, #12
1b59e: af00 add r7, sp, #0
1b5a0: 6078 str r0, [r7, #4]
1b5a2: 6039 str r1, [r7, #0]
1b5a4: ed97 7a01 vldr s14, [r7, #4]
1b5a8: edd7 7a00 vldr s15, [r7]
1b5ac: ee77 7a27 vadd.f32 s15, s14, s15
1b5b0: ee17 3a90 vmov r3, s15
1b5b4: 4618 mov r0, r3
1b5b6: 370c adds r7, #12
1b5b8: 46bd mov sp, r7
1b5ba: f85d 7b04 ldr.w r7, [sp], #4
1b5be: 4770 bx lr
Bonus Round: Dynamically Enabling the FPU Link to heading
If you read the description of CONFIG_FPU above, you’ll notice that it does
not only allow the configuration of softfp and hard ABIs, it also enables
the FPU on reset. z_arm_floating_point_init() is called
from
z_prep_c() whenever a processor has an FPU (CONFIG_CPU_HAS_FPU), whether it
is configured to be enabled or not.
FUNC_NORETURN void z_prep_c(void)
{
soc_prep_hook();
relocate_vector_table();
#if defined(CONFIG_CPU_HAS_FPU)
z_arm_floating_point_init();
#endif
arch_bss_zero();
arch_data_copy();
#if defined(CONFIG_ARM_CUSTOM_INTERRUPT_CONTROLLER)
/* Invoke SoC-specific interrupt controller initialization */
z_soc_irq_init();
#else
z_arm_interrupt_init();
#endif /* CONFIG_ARM_CUSTOM_INTERRUPT_CONTROLLER */
#if CONFIG_ARCH_CACHE
arch_cache_init();
#endif
#ifdef CONFIG_NULL_POINTER_EXCEPTION_DETECTION_DWT
z_arm_debug_enable_null_pointer_detection();
#endif
z_cstart();
CODE_UNREACHABLE;
}
Depending on the value of CONFIG_FPU, as well as other configuration,
z_arm_floating_point_init() will setup the FPU accordingly. This is
accomplished by first clearing the Coprocessor Access Control Register
(CPACR),
then, if CONFIG_FPU=y, setting the CP10 (CPACR_CP10_PRIV_ACCESS) and CP11
(CPACR_CP11_PRIV_ACCESS) flags to enable the floating point coprocessor. Next,
the Floating-Point Context Control Register
(FPCCR)
flags for context state stacking
(ASPEN)
and lazy context save
(LSPEN)
are configured, then the Floating-point Status and Control Register
(FPSCR)
is cleared.
#if defined(CONFIG_CPU_HAS_FPU)
static inline void z_arm_floating_point_init(void)
{
/*
* Upon reset, the Co-Processor Access Control Register is, normally,
* 0x00000000. However, it might be left un-cleared by firmware running
* before Zephyr boot.
*/
SCB->CPACR &= (~(CPACR_CP10_Msk | CPACR_CP11_Msk));
#if defined(CONFIG_FPU)
/*
* Enable CP10 and CP11 Co-Processors to enable access to floating
* point registers.
*/
#if defined(CONFIG_USERSPACE)
/* Full access */
SCB->CPACR |= CPACR_CP10_FULL_ACCESS | CPACR_CP11_FULL_ACCESS;
#else
/* Privileged access only */
SCB->CPACR |= CPACR_CP10_PRIV_ACCESS | CPACR_CP11_PRIV_ACCESS;
#endif /* CONFIG_USERSPACE */
/*
* Upon reset, the FPU Context Control Register is 0xC0000000
* (both Automatic and Lazy state preservation is enabled).
*/
#if defined(CONFIG_MULTITHREADING) && !defined(CONFIG_FPU_SHARING)
/* Unshared FP registers (multithreading) mode. We disable the
* automatic stacking of FP registers (automatic setting of
* FPCA bit in the CONTROL register), upon exception entries,
* as the FP registers are to be used by a single context (and
* the use of FP registers in ISRs is not supported). This
* configuration improves interrupt latency and decreases the
* stack memory requirement for the (single) thread that makes
* use of the FP co-processor.
*/
FPU->FPCCR &= (~(FPU_FPCCR_ASPEN_Msk | FPU_FPCCR_LSPEN_Msk));
#else
/*
* FP register sharing (multithreading) mode or single-threading mode.
*
* Enable both automatic and lazy state preservation of the FP context.
* The FPCA bit of the CONTROL register will be automatically set, if
* the thread uses the floating point registers. Because of lazy state
* preservation the volatile FP registers will not be stacked upon
* exception entry, however, the required area in the stack frame will
* be reserved for them. This configuration improves interrupt latency.
* The registers will eventually be stacked when the thread is swapped
* out during context-switch or if an ISR attempts to execute floating
* point instructions.
*/
FPU->FPCCR = FPU_FPCCR_ASPEN_Msk | FPU_FPCCR_LSPEN_Msk;
#endif /* CONFIG_FPU_SHARING */
/* Make the side-effects of modifying the FPCCR be realized
* immediately.
*/
barrier_dsync_fence_full();
barrier_isync_fence_full();
/* Initialize the Floating Point Status and Control Register. */
#if defined(CONFIG_ARMV8_1_M_MAINLINE)
/*
* For ARMv8.1-M with FPU, the FPSCR[18:16] LTPSIZE field must be set
* to 0b100 for "Tail predication not applied" as it's reset value
*/
__set_FPSCR(4 << FPU_FPDSCR_LTPSIZE_Pos);
#else
__set_FPSCR(0);
#endif
/*
* Note:
* The use of the FP register bank is enabled, however the FP context
* will be activated (FPCA bit on the CONTROL register) in the presence
* of floating point instructions.
*/
#endif /* CONFIG_FPU */
/*
* Upon reset, the CONTROL.FPCA bit is, normally, cleared. However,
* it might be left un-cleared by firmware running before Zephyr boot.
* We must clear this bit to prevent errors in exception unstacking.
*
* Note:
* In Sharing FP Registers mode CONTROL.FPCA is cleared before switching
* to main, so it may be skipped here (saving few boot cycles).
*
* If CONFIG_INIT_ARCH_HW_AT_BOOT is set, CONTROL is cleared at reset.
*/
#if (!defined(CONFIG_FPU) || !defined(CONFIG_FPU_SHARING)) && \
(!defined(CONFIG_INIT_ARCH_HW_AT_BOOT))
__set_CONTROL(__get_CONTROL() & (~(CONTROL_FPCA_Msk)));
#endif
}
Because z_arm_floating_point_init() ensures that the FPU is enabled whenever
CONFIG_FP_HARDABI=y or CONFIG_FP_SOFTABI=y, executing floating point
instructions will not trigger an exception. However, if the FPU were to not be
enabled prior to the execution of a floating point instruction, an NOCP (No
Coprocessor) Usage
Fault
would be generated. This can be demonstrated by not setting CONFIG_FPU,
then adding the following lines to your CMakeLists.txt.
list(APPEND TOOLCHAIN_C_FLAGS -mfloat-abi=softfp)
list(APPEND TOOLCHAIN_CXX_FLAGS -mfloat-abi=softfp)
Stepping through the program with GDB, the fault can be observed.
(gdb) b addf
Breakpoint 1 at 0x1a662: file main.c.
(gdb) c
Continuing.
Breakpoint 1, addf (a=1.87308763e-40, b=1.09258461e-19) at main.c
141 {
(gdb) s
142 return a + b;
(gdb) x/i $pc
=> 0x1a66c <addf+10>: vldr s14, [r7, #4]
(gdb) s
z_arm_usage_fault () at zephyr/arch/arm/core/cortex_m/fault_s.S:80
80 mrs r0, MSP
(gdb) x/1xh 0xe000ed2a
0xe000ed2a: 0x0008
As expected, the Usage Fault Status Register
(0xe000ed2a)
has bit 3 asserted (1000 = 0x0008), which corresponds to the NOCP usage fault.
However, there is no specific reason why the FPU must be enabled at reset, or
even enabled continuously. The same sequence of operations could be added to the
addf function, to “just in time” enable the FPU.
float __attribute__((optimize("O0"))) addf(float a, float b)
{
SCB->CPACR &= (~(CPACR_CP10_Msk | CPACR_CP11_Msk));
SCB->CPACR |= CPACR_CP10_PRIV_ACCESS | CPACR_CP11_PRIV_ACCESS;
FPU->FPCCR = FPU_FPCCR_ASPEN_Msk | FPU_FPCCR_LSPEN_Msk;
barrier_dsync_fence_full();
barrier_isync_fence_full();
__set_FPSCR(0);
return a + b;
}
Recompiling and flashing the nRF52840 results in successful execution of the floating point operations in the function. A similar behavior could be accomplished by adjusting the usage fault handler to enable the FPU if the NOCP bit is set, then returning execution to the instruction that generated the fault.
While there are many reasons why turning the FPU on and off in this manner could lead to issues and should be used with extreme caution, there are legitimate use cases for limiting the time in which floating point hardware is enabled. We’ll explore these scenarios and some of the trade-offs between hardware and software floating point in a future post.