Floating Point Fun on Cortex-M Processors

In my recent post on the PSA Crypto API, I demonstrated the use of the API on two different MCUs: the nRF52840 and the ESP32-S3. In the case of the former, the ECDSA signature operation was eventually executed in a closed source library that manages communication between the Arm Cortex-M4 processor and the Arm TrustZone CryptoCell 310 security subsytem. Readers that ventured down the rabbit hole of links in the post may have noticed that there are variants of the nrf_cc310_mbedcrypto libraries for hard-float and soft-float. If you have ever hit an error from your linker of the following style, you know exactly why.

ld.bfd: error: X uses VFP register arguments, Y does not
ld.bfd: failed to merge target specific data of file

Arm defines three floating point Application Binary Interface (ABI) options, which are controlled by the -mfloat-abi compiler flag.

soft: Soft ABI without FPU hardare: All floating-point operations are handled by the runtime library functions. Values are passed through integer register bank.
softfp: Soft ABI with FPU hardware: This allows the compiled code to generate codes that directly access the FPU. But, if a calculation needs to use a runtime library function, a soft-float calling convention is used. Values are passed through integer register bank.
hard: Hard ABI: This allows the compiled code to generate codes that directly accesss the FPU and use FPU-specific calling conventions when calling runtime library functions.

Arm, like most Instruction Set Architectures (ISAs), passes arguments to subroutines in general purpose registers (GPRs), specifically r0-r3. When the number or size of arguments exceeds the available GPRs, the remaining arguments are “spilled” to the stack, where they can be accessed by the callee. However, when a processor includes a Floating Point Unit (FPU) (more specifically for Armv7-M processors, the C10 and C11 coprocessors), and thus the floating point extension, there is an additional register bank with 32 floating point registers (s0-s31).

Side note: you may see the term Vector Floating Point (VFP) when referring to floating point on Armv7-M processors, such as the Cortex-M4. The reference manual explains why this is the case: “In the ARMv7-A and ARMv7-R architecture profiles, floating point instructions are called VFP instructions and have mnemonics starting with V. Because ARM assembler is highly consistent across architecture versions and profiles, ARMv7-M retains these mnemonics, but normally describes the instructions as floating point instructions, or FP instructions.”

When using the hard ABI, the s0-s15 registers can be used for passing arguments to subroutines. The use of hard also indicates that floating point instructions (load and store, register transfer, data processing) may be used within routines.

When using softfp, floating point instructions are allowed within routines, but arguments cannot be passed in floating point registers. soft uses the same calling convention as softfp, and is thus compatible, but does not allow for the use of floating point instructions. When floating point operations are performed without support for floating point instructions, they must be emulated in software. When you see the error described at the beginning of this post, you are mixing soft/softfp with hard, which the linker will refuse. It is able to determine the ABI of an object file being linked by looking at the Arm attributes section, which differs for each variant. For example, on the nRF52840, the attributes appear as follows (extracted via readelf).

hard

Attribute Section: aeabi
File Attributes
  Tag_CPU_name: "7E-M"
  Tag_CPU_arch: v7E-M
  Tag_CPU_arch_profile: Microcontroller
  Tag_THUMB_ISA_use: Thumb-2
  Tag_FP_arch: VFPv4-D16
  Tag_ABI_PCS_wchar_t: 4
  Tag_ABI_FP_denormal: Needed
  Tag_ABI_FP_exceptions: Needed
  Tag_ABI_FP_number_model: IEEE 754
  Tag_ABI_align_needed: 8-byte
  Tag_ABI_align_preserved: 8-byte, except leaf SP
  Tag_ABI_enum_size: small
  Tag_ABI_HardFP_use: SP only
  Tag_ABI_VFP_args: VFP registers
  Tag_ABI_optimization_goals: Aggressive Speed
  Tag_CPU_unaligned_access: v6

softfp

Attribute Section: aeabi
File Attributes
  Tag_CPU_name: "7E-M"
  Tag_CPU_arch: v7E-M
  Tag_CPU_arch_profile: Microcontroller
  Tag_THUMB_ISA_use: Thumb-2
  Tag_FP_arch: VFPv4-D16
  Tag_ABI_PCS_wchar_t: 4
  Tag_ABI_FP_rounding: Needed
  Tag_ABI_FP_denormal: Needed
  Tag_ABI_FP_exceptions: Needed
  Tag_ABI_FP_user_exceptions: Needed
  Tag_ABI_FP_number_model: IEEE 754
  Tag_ABI_align_needed: 8-byte
  Tag_ABI_enum_size: small
  Tag_ABI_HardFP_use: SP only
  Tag_ABI_optimization_goals: Aggressive Size
  Tag_CPU_unaligned_access: v6
  Tag_ABI_FP_16bit_format: IEEE 754

soft

Attribute Section: aeabi
File Attributes
  Tag_CPU_name: "7E-M"
  Tag_CPU_arch: v7E-M
  Tag_CPU_arch_profile: Microcontroller
  Tag_THUMB_ISA_use: Thumb-2
  Tag_ABI_PCS_wchar_t: 4
  Tag_ABI_FP_denormal: Needed
  Tag_ABI_FP_exceptions: Needed
  Tag_ABI_FP_number_model: IEEE 754
  Tag_ABI_align_needed: 8-byte
  Tag_ABI_align_preserved: 8-byte, except leaf SP
  Tag_ABI_enum_size: small
  Tag_ABI_optimization_goals: Aggressive Speed
  Tag_CPU_unaligned_access: v6

Floating Point ABIs in Practice Link to heading

The output of compiling with each ABI can be observed with a simple function like the following addf with optimizations turned off to ensure that it is not inlined or stripped entirely.

float __attribute__((optimize("O0"))) addf(float a, float b)
{
    return a + b;
}

If invoking the compiler directly, you can pass the -mfloat-abi option, but if you are leveraging a build system you may need to identify configuration specific to the platform. For example, when using Zephyr, the FPU is enabled by setting CONFIG_FPU=y. The option depends on CONFIG_CPU_HAS_FPU, but defaults to false even when an FPU is present.

config FPU
	bool "Floating point unit (FPU)"
	depends on CPU_HAS_FPU
	help
	  This option enables the hardware Floating Point Unit (FPU), in order to
	  support using the floating point registers and instructions.

	  When this option is enabled, by default, threads may use the floating
	  point registers only in an exclusive manner, and this usually means that
	  only one thread may perform floating point operations.

	  If it is necessary for multiple threads to perform concurrent floating
	  point operations, the "FPU register sharing" option must be enabled to
	  preserve the floating point registers across context switches.

	  Note that this option cannot be selected for the platforms that do not
	  include a hardware floating point unit; the floating point support for
	  those platforms is dependent on the availability of the toolchain-
	  provided software floating point library.

Because the nRF52840 has an FPU, CONFIG_CPU_HAS_FPU is selected.

config SOC_NRF52840
	select CPU_CORTEX_M_HAS_DWT
	select CPU_HAS_FPU

With the FPU being off by default, west build will result in the use of the soft ABI.

west build -p -b nrf52840dk/nrf52840 .

As expected, arguments are passed in GPRs, and the floating point addition operation makes a call to the runtime library function __addsf3, which implements the operation in software.

0001a860 <addf>:
   1a860:	b580      	push	{r7, lr}
   1a862:	b082      	sub	sp, #8
   1a864:	af00      	add	r7, sp, #0
   1a866:	6078      	str	r0, [r7, #4]
   1a868:	6039      	str	r1, [r7, #0]
   1a86a:	6839      	ldr	r1, [r7, #0]
   1a86c:	6878      	ldr	r0, [r7, #4]
   1a86e:	f7e5 fc4d 	bl	10c <__addsf3>
   1a872:	4603      	mov	r3, r0
   1a874:	4618      	mov	r0, r3
   1a876:	3708      	adds	r7, #8
   1a878:	46bd      	mov	sp, r7
   1a87a:	bd80      	pop	{r7, pc}

When setting CONFIG_FPU=y and building again, the hard float ABI will be used by default, and the floating point arguments will instead be passed to addf in floating point registers, then vadd.f32 will be used to perform the addition.

0001b5ae <addf>:
   1b5ae:	b480      	push	{r7}
   1b5b0:	b083      	sub	sp, #12
   1b5b2:	af00      	add	r7, sp, #0
   1b5b4:	ed87 0a01 	vstr	s0, [r7, #4]
   1b5b8:	edc7 0a00 	vstr	s1, [r7]
   1b5bc:	ed97 7a01 	vldr	s14, [r7, #4]
   1b5c0:	edd7 7a00 	vldr	s15, [r7]
   1b5c4:	ee77 7a27 	vadd.f32	s15, s14, s15
   1b5c8:	eeb0 0a67 	vmov.f32	s0, s15
   1b5cc:	370c      	adds	r7, #12
   1b5ce:	46bd      	mov	sp, r7
   1b5d0:	f85d 7b04 	ldr.w	r7, [sp], #4
   1b5d4:	4770      	bx	lr

Setting CONFIG_FP_SOFTABI=y will instead result in the use of softfp.

choice
	prompt "Floating point ABI"
	default FP_HARDABI
	depends on FPU

config FP_HARDABI
	bool "Floating point Hard ABI"
	help
	  This option selects the Floating point ABI in which hardware floating
	  point instructions are generated and uses FPU-specific calling
	  conventions.

config FP_SOFTABI
	bool "Floating point Soft ABI"
	help
	  This option selects the Floating point ABI in which hardware floating
	  point instructions are generated but soft-float calling conventions.

endchoice

The presence of the flag can be observed by providing the -v flag (i.e. west -v build).

-mcpu=cortex-m4  -mthumb  -mabi=aapcs  -mfpu=fpv4-sp-d16  -mfloat-abi=softfp  -mfp16-format=ieee

Finally, the observed output includes the soft calling convention (arguments passed in GPRs), but still utilizes the floating point registers and instructions within the routine.

0001b59a <addf>:
   1b59a:	b480      	push	{r7}
   1b59c:	b083      	sub	sp, #12
   1b59e:	af00      	add	r7, sp, #0
   1b5a0:	6078      	str	r0, [r7, #4]
   1b5a2:	6039      	str	r1, [r7, #0]
   1b5a4:	ed97 7a01 	vldr	s14, [r7, #4]
   1b5a8:	edd7 7a00 	vldr	s15, [r7]
   1b5ac:	ee77 7a27 	vadd.f32	s15, s14, s15
   1b5b0:	ee17 3a90 	vmov	r3, s15
   1b5b4:	4618      	mov	r0, r3
   1b5b6:	370c      	adds	r7, #12
   1b5b8:	46bd      	mov	sp, r7
   1b5ba:	f85d 7b04 	ldr.w	r7, [sp], #4
   1b5be:	4770      	bx	lr

Bonus Round: Dynamically Enabling the FPU Link to heading

If you read the description of CONFIG_FPU above, you’ll notice that it does not only allow the configuration of softfp and hard ABIs, it also enables the FPU on reset. z_arm_floating_point_init() is called from z_prep_c() whenever a processor has an FPU (CONFIG_CPU_HAS_FPU), whether it is configured to be enabled or not.

FUNC_NORETURN void z_prep_c(void)
{
	soc_prep_hook();

	relocate_vector_table();
#if defined(CONFIG_CPU_HAS_FPU)
	z_arm_floating_point_init();
#endif
	arch_bss_zero();
	arch_data_copy();
#if defined(CONFIG_ARM_CUSTOM_INTERRUPT_CONTROLLER)
	/* Invoke SoC-specific interrupt controller initialization */
	z_soc_irq_init();
#else
	z_arm_interrupt_init();
#endif /* CONFIG_ARM_CUSTOM_INTERRUPT_CONTROLLER */
#if CONFIG_ARCH_CACHE
	arch_cache_init();
#endif

#ifdef CONFIG_NULL_POINTER_EXCEPTION_DETECTION_DWT
	z_arm_debug_enable_null_pointer_detection();
#endif
	z_cstart();
	CODE_UNREACHABLE;
}

Depending on the value of CONFIG_FPU, as well as other configuration, z_arm_floating_point_init() will setup the FPU accordingly. This is accomplished by first clearing the Coprocessor Access Control Register (CPACR), then, if CONFIG_FPU=y, setting the CP10 (CPACR_CP10_PRIV_ACCESS) and CP11 (CPACR_CP11_PRIV_ACCESS) flags to enable the floating point coprocessor. Next, the Floating-Point Context Control Register (FPCCR) flags for context state stacking (ASPEN) and lazy context save (LSPEN) are configured, then the Floating-point Status and Control Register (FPSCR) is cleared.

#if defined(CONFIG_CPU_HAS_FPU)
static inline void z_arm_floating_point_init(void)
{
	/*
	 * Upon reset, the Co-Processor Access Control Register is, normally,
	 * 0x00000000. However, it might be left un-cleared by firmware running
	 * before Zephyr boot.
	 */
	SCB->CPACR &= (~(CPACR_CP10_Msk | CPACR_CP11_Msk));

#if defined(CONFIG_FPU)
	/*
	 * Enable CP10 and CP11 Co-Processors to enable access to floating
	 * point registers.
	 */
#if defined(CONFIG_USERSPACE)
	/* Full access */
	SCB->CPACR |= CPACR_CP10_FULL_ACCESS | CPACR_CP11_FULL_ACCESS;
#else
	/* Privileged access only */
	SCB->CPACR |= CPACR_CP10_PRIV_ACCESS | CPACR_CP11_PRIV_ACCESS;
#endif  /* CONFIG_USERSPACE */
	/*
	 * Upon reset, the FPU Context Control Register is 0xC0000000
	 * (both Automatic and Lazy state preservation is enabled).
	 */
#if defined(CONFIG_MULTITHREADING) && !defined(CONFIG_FPU_SHARING)
	/* Unshared FP registers (multithreading) mode. We disable the
	 * automatic stacking of FP registers (automatic setting of
	 * FPCA bit in the CONTROL register), upon exception entries,
	 * as the FP registers are to be used by a single context (and
	 * the use of FP registers in ISRs is not supported). This
	 * configuration improves interrupt latency and decreases the
	 * stack memory requirement for the (single) thread that makes
	 * use of the FP co-processor.
	 */
	FPU->FPCCR &= (~(FPU_FPCCR_ASPEN_Msk | FPU_FPCCR_LSPEN_Msk));
#else
	/*
	 * FP register sharing (multithreading) mode or single-threading mode.
	 *
	 * Enable both automatic and lazy state preservation of the FP context.
	 * The FPCA bit of the CONTROL register will be automatically set, if
	 * the thread uses the floating point registers. Because of lazy state
	 * preservation the volatile FP registers will not be stacked upon
	 * exception entry, however, the required area in the stack frame will
	 * be reserved for them. This configuration improves interrupt latency.
	 * The registers will eventually be stacked when the thread is swapped
	 * out during context-switch or if an ISR attempts to execute floating
	 * point instructions.
	 */
	FPU->FPCCR = FPU_FPCCR_ASPEN_Msk | FPU_FPCCR_LSPEN_Msk;
#endif /* CONFIG_FPU_SHARING */

	/* Make the side-effects of modifying the FPCCR be realized
	 * immediately.
	 */
	barrier_dsync_fence_full();
	barrier_isync_fence_full();

	/* Initialize the Floating Point Status and Control Register. */
#if defined(CONFIG_ARMV8_1_M_MAINLINE)
	/*
	 * For ARMv8.1-M with FPU, the FPSCR[18:16] LTPSIZE field must be set
	 * to 0b100 for "Tail predication not applied" as it's reset value
	 */
	__set_FPSCR(4 << FPU_FPDSCR_LTPSIZE_Pos);
#else
	__set_FPSCR(0);
#endif

	/*
	 * Note:
	 * The use of the FP register bank is enabled, however the FP context
	 * will be activated (FPCA bit on the CONTROL register) in the presence
	 * of floating point instructions.
	 */

#endif /* CONFIG_FPU */

	/*
	 * Upon reset, the CONTROL.FPCA bit is, normally, cleared. However,
	 * it might be left un-cleared by firmware running before Zephyr boot.
	 * We must clear this bit to prevent errors in exception unstacking.
	 *
	 * Note:
	 * In Sharing FP Registers mode CONTROL.FPCA is cleared before switching
	 * to main, so it may be skipped here (saving few boot cycles).
	 *
	 * If CONFIG_INIT_ARCH_HW_AT_BOOT is set, CONTROL is cleared at reset.
	 */
#if (!defined(CONFIG_FPU) || !defined(CONFIG_FPU_SHARING)) &&                                      \
	(!defined(CONFIG_INIT_ARCH_HW_AT_BOOT))

	__set_CONTROL(__get_CONTROL() & (~(CONTROL_FPCA_Msk)));
#endif
}

Because z_arm_floating_point_init() ensures that the FPU is enabled whenever CONFIG_FP_HARDABI=y or CONFIG_FP_SOFTABI=y, executing floating point instructions will not trigger an exception. However, if the FPU were to not be enabled prior to the execution of a floating point instruction, an NOCP (No Coprocessor) Usage Fault would be generated. This can be demonstrated by not setting CONFIG_FPU, then adding the following lines to your CMakeLists.txt.

list(APPEND TOOLCHAIN_C_FLAGS -mfloat-abi=softfp)
list(APPEND TOOLCHAIN_CXX_FLAGS -mfloat-abi=softfp)

Stepping through the program with GDB, the fault can be observed.

(gdb) b addf
Breakpoint 1 at 0x1a662: file main.c.
(gdb) c
Continuing.

Breakpoint 1, addf (a=1.87308763e-40, b=1.09258461e-19) at main.c
141	{
(gdb) s
142	    return a + b;
(gdb) x/i $pc
=> 0x1a66c <addf+10>:	vldr	s14, [r7, #4]
(gdb) s
z_arm_usage_fault () at zephyr/arch/arm/core/cortex_m/fault_s.S:80
80		mrs r0, MSP
(gdb) x/1xh 0xe000ed2a
0xe000ed2a:	0x0008

As expected, the Usage Fault Status Register (0xe000ed2a) has bit 3 asserted (1000 = 0x0008), which corresponds to the NOCP usage fault. However, there is no specific reason why the FPU must be enabled at reset, or even enabled continuously. The same sequence of operations could be added to the addf function, to “just in time” enable the FPU.

float __attribute__((optimize("O0"))) addf(float a, float b)
{
    SCB->CPACR &= (~(CPACR_CP10_Msk | CPACR_CP11_Msk));
    SCB->CPACR |= CPACR_CP10_PRIV_ACCESS | CPACR_CP11_PRIV_ACCESS;
    FPU->FPCCR = FPU_FPCCR_ASPEN_Msk | FPU_FPCCR_LSPEN_Msk;
    barrier_dsync_fence_full();
    barrier_isync_fence_full();
    __set_FPSCR(0);
    return a + b;
}

Recompiling and flashing the nRF52840 results in successful execution of the floating point operations in the function. A similar behavior could be accomplished by adjusting the usage fault handler to enable the FPU if the NOCP bit is set, then returning execution to the instruction that generated the fault.

While there are many reasons why turning the FPU on and off in this manner could lead to issues and should be used with extreme caution, there are legitimate use cases for limiting the time in which floating point hardware is enabled. We’ll explore these scenarios and some of the trade-offs between hardware and software floating point in a future post.