The buzz around “edge AI”, which means something slightly different to almost everyone you talk to, is well past reaching a fever pitch. Regardless of what edge AI means to you, the one commonality is typically that the hardware on which inference is being performed is constrained in one or more dimensions, whether it be compute, memory, or network bandwidth. Perhaps the most constrained of these platforms are microcontrollers.
I have found that, while there is much discourse around “running AI” (i.e. performing inference) on microcontrollers, there is a general lack of information about what these systems are actually capable of, and how new hardware advancements impact that equation. It is my hope with this series to peel back some of the layers of terminology and explore what actually happens between supplying inputs to a model and receiving outputs. Along the way, we’ll ground our exploration in performing inference with real models on real constrained hardware.
While “weights” get the majority of the attention with AI models, they alone are
not sufficient for performing inference. Depending on how a model is distributed
and what runtime is used, additional data or metadata may be supplied alongside
the model, or may be defined explicitly in software that interacts with the
weights. The most popular runtime for microcontrollers is Tensorflow Lite for
Microcontrollers
(tflite-micro
),
which is an optimized version of Tensorflow
Lite.
Note: Google recently rebranded Tensorflow Lite to LiteRT, and
tflite-micro
to LiteRT for Microcontrollers.
tflite-micro
uses the .tflite
file format, which encodes data using
FlatBuffers. Unlike some other model file formats,
.tflite
files include not only the tensors that encapuslate model weights, but
also the computation graph, which informs the runtime of what operations to
use when performing inference. In order to do so, there needs to be a defined
set of operators. This is somewhat analagous to instructions defined in an
instruction set architecture
(ISA) for a
processor. With an ISA, a compiler will take a higher level programming language
and map the behavior onto instructions available in the ISA. Tensorflow supports
an extensive set of built-in
operators, while Tensorflow Lite, and
thus tflite-micro
, supports only a
subset.
Continuing the analaogy, many processors implement specific versions of the ARM architecture, but that doesn’t mean that processors implementing the same ISA are equivalent. Every instruction that is supported has to be implemented in hardware, and decisions about how the processor is designed can impact performance on multiple dimensions. Similarly, while Tensorflow Lite defines a set of operators, the implementation of those operators, which are referred to as kernels, may vary. Kernels are implemented in software, but depending on the underlying hardware, a kernel might take many instructions to execute, or may be able to optimized to leverage dedicated hardware support.
A simple example is the addition operator
(TFL::AddOp
). We’ll
cover how operators and kernels are registered and invoked in a future post, but
let’s start by taking a look at the default tflite-micro
addition operator
logic.
tensorflow/lite/micro/kernels/add.cc
TfLiteStatus AddEval(TfLiteContext* context, TfLiteNode* node) {
auto* params = reinterpret_cast<TfLiteAddParams*>(node->builtin_data);
TFLITE_DCHECK(node->user_data != nullptr);
const OpDataAdd* data = static_cast<const OpDataAdd*>(node->user_data);
const TfLiteEvalTensor* input1 =
tflite::micro::GetEvalInput(context, node, kAddInputTensor1);
const TfLiteEvalTensor* input2 =
tflite::micro::GetEvalInput(context, node, kAddInputTensor2);
TfLiteEvalTensor* output =
tflite::micro::GetEvalOutput(context, node, kAddOutputTensor);
if (output->type == kTfLiteFloat32 || output->type == kTfLiteInt32) {
TF_LITE_ENSURE_OK(
context, EvalAdd(context, node, params, data, input1, input2, output));
} else if (output->type == kTfLiteInt8 || output->type == kTfLiteInt16) {
TF_LITE_ENSURE_OK(context, EvalAddQuantized(context, node, params, data,
input1, input2, output));
} else {
MicroPrintf("Type %s (%d) not supported.", TfLiteTypeGetName(output->type),
output->type);
return kTfLiteError;
}
return kTfLiteOk;
}
TFLMRegistration Register_ADD() {
return tflite::micro::RegisterOp(AddInit, AddPrepare, AddEval);
}
As can be observed in AddEval()
, the type of output we are expecting
influences the implementation of the operator. To illustrate how the underlying
hardware impacts performance, let’s focus on the case in which we expect
kTfLiteInt8
(signed 8-bit integer) or kTfLiteInt16
(signed 16-bit integer)
output, meaning that we’ll call EvalAddQuantized()
.
tensorflow/lite/micro/kernels/add.cc
TfLiteStatus EvalAddQuantized(TfLiteContext* context, TfLiteNode* node,
TfLiteAddParams* params, const OpDataAdd* data,
const TfLiteEvalTensor* input1,
const TfLiteEvalTensor* input2,
TfLiteEvalTensor* output) {
tflite::ArithmeticParams op_params = {};
op_params.left_shift = data->left_shift;
op_params.input1_offset = data->input1_offset;
op_params.input1_multiplier = data->input1_multiplier;
op_params.input1_shift = data->input1_shift;
op_params.input2_offset = data->input2_offset;
op_params.input2_multiplier = data->input2_multiplier;
op_params.input2_shift = data->input2_shift;
op_params.output_offset = data->output_offset;
op_params.output_multiplier = data->output_multiplier;
op_params.output_shift = data->output_shift;
SetActivationParams(data->output_activation_min, data->output_activation_max,
&op_params);
bool need_broadcast = reference_ops::ProcessBroadcastShapes(
tflite::micro::GetTensorShape(input1),
tflite::micro::GetTensorShape(input2), &op_params);
switch (output->type) {
case kTfLiteInt8: {
if (need_broadcast) {
reference_integer_ops::BroadcastAdd4DSlow(
op_params, tflite::micro::GetTensorShape(input1),
tflite::micro::GetTensorData<int8_t>(input1),
tflite::micro::GetTensorShape(input2),
tflite::micro::GetTensorData<int8_t>(input2),
tflite::micro::GetTensorShape(output),
tflite::micro::GetTensorData<int8_t>(output));
} else {
reference_integer_ops::Add(
op_params, tflite::micro::GetTensorShape(input1),
tflite::micro::GetTensorData<int8_t>(input1),
tflite::micro::GetTensorShape(input2),
tflite::micro::GetTensorData<int8_t>(input2),
tflite::micro::GetTensorShape(output),
tflite::micro::GetTensorData<int8_t>(output));
}
break;
}
case kTfLiteInt16: {
if (need_broadcast) {
reference_ops::BroadcastAdd4DSlow(
op_params, tflite::micro::GetTensorShape(input1),
tflite::micro::GetTensorData<int16_t>(input1),
tflite::micro::GetTensorShape(input2),
tflite::micro::GetTensorData<int16_t>(input2),
tflite::micro::GetTensorShape(output),
tflite::micro::GetTensorData<int16_t>(output));
} else {
reference_ops::Add(op_params, tflite::micro::GetTensorShape(input1),
tflite::micro::GetTensorData<int16_t>(input1),
tflite::micro::GetTensorShape(input2),
tflite::micro::GetTensorData<int16_t>(input2),
tflite::micro::GetTensorShape(output),
tflite::micro::GetTensorData<int16_t>(output),
false);
}
break;
}
default:
MicroPrintf("Type %s (%d) not supported.",
TfLiteTypeGetName(output->type), output->type);
return kTfLiteError;
}
return kTfLiteOk;
}
For kTfLiteInt8
output when broadcast is not required, we make a call to
reference_integer_ops::Add()
.
“Broadcasting” is the process of making arrays to have compatible shapes for arithmetic operations. For example, matrix addition requires that the two input matrices have the same dimensions.
tensorflow/lite/kernels/internal/reference/add.h
template <typename T>
inline void Add(const ArithmeticParams& params,
const RuntimeShape& input1_shape, const T* input1_data,
const RuntimeShape& input2_shape, const T* input2_data,
const RuntimeShape& output_shape, T* output_data) {
T activation_min, activation_max;
GetActivationParams(params, &activation_min, &activation_max);
const int flat_size =
MatchingElementsSize(input1_shape, input2_shape, output_shape);
for (int i = 0; i < flat_size; ++i) {
output_data[i] = ActivationFunctionWithMinMax(
input1_data[i] + input2_data[i], activation_min, activation_max);
}
}
As you might expect, this implementation effectively boils down to iterating
through the two input tensors and calling input1_data[i] + input2_data[i]
.
This can be thought of as a lowest common denominator implementation in that it
doesn’t leverage any hardware-specific functionality; any processor can perform
sequential addition. However, as evidenced by the effectively unlimited demand
in the Graphics Processing Unit
(GPU) market, there are
significant performance gains to be had by parallelizing operations in hardware.
Fortunately, many of the operations that are necessary for performing inference
are “embarassingly
parallel”. For example,
rather than iterating through tensors to perform sequential addition, which may
take many processor cycles, we could point a processor to the two inputs and, if
supported by the hardware, the entire matrix addition operation could be
completed in a “single” cycle.
It is unlikely that the operation would literally take one cycle given the complexity of modern processors, but the point is that the runtime of the entire operation could be reduced to the same order of magnitude of cycles as one step of the sequential addition implementation.
Obviously microcontrollers don’t have massive GPUs like those installed in cloud provider datacenters. However, many do implement architecture extensions that enable these common operations to be accelerated. Because Tensorflow Lite enables different kernel implementations, this hardware acceleration can be leveraged when supported.
Many microcontrollers implement Arm Cortex-M cores. For example, chips like the Raspberry Pi RP2350 and Nordic Semiconductor nRF54H20 implement multiple Arm Cortex-M33 cores. The former implements the Armv8-M Digital Signal Processing (DSP) Extension, which adds support for Single Instruction Multiple Data (SIMD) instructions. More capable chips, like the Alif Ensemble E3 implement Cortex-M55 cores, with support for the Armv8-M Vector Extension (MVE), also referred to as Arm Helium. The E3 also includes dedicated accelerators in the form of Arm’s Ethos-U Neural Processing Units (NPU).
Arm provides software that allows for hardware that supports one or more of these extensions to accelerate Tensorflow Lite kernel implementations. For example, the CMSIS-NN library offers kernel implementations that do not leverage optimization (i.e. pure C), leverage just the DSP extension, or leverage the MVE extension (which requires the implementation of the DSP extension). Tensorflow Lite has kernel “ports” that integrate the CMSIS-NN functionality. Let’s take a look at how the add operation differs when using CMSIS-NN kernels.
The
setup
looks largely the same as the first add operation kernel we observed. However,
when we reach EvalAddQuantizedInt8()
, we can start to see where hardware
acceleration is leveraged.
tensorflow/lite/micro/kernels/cmsis_nn/add.cc
TfLiteStatus EvalAddQuantizedInt8(TfLiteContext* context, TfLiteNode* node,
TfLiteAddParams* params, const OpData* data,
const TfLiteEvalTensor* input1,
const TfLiteEvalTensor* input2,
TfLiteEvalTensor* output) {
tflite::ArithmeticParams op_params;
UpdateOpParams(&op_params, data);
bool need_broadcast = reference_ops::ProcessBroadcastShapes(
tflite::micro::GetTensorShape(input1),
tflite::micro::GetTensorShape(input2), &op_params);
if (need_broadcast) {
reference_integer_ops::BroadcastAdd4DSlow(
op_params, tflite::micro::GetTensorShape(input1),
tflite::micro::GetTensorData<int8_t>(input1),
tflite::micro::GetTensorShape(input2),
tflite::micro::GetTensorData<int8_t>(input2),
tflite::micro::GetTensorShape(output),
tflite::micro::GetTensorData<int8_t>(output));
} else {
arm_elementwise_add_s8(
tflite::micro::GetTensorData<int8_t>(input1),
tflite::micro::GetTensorData<int8_t>(input2), op_params.input1_offset,
op_params.input1_multiplier, op_params.input1_shift,
op_params.input2_offset, op_params.input2_multiplier,
op_params.input2_shift, op_params.left_shift,
tflite::micro::GetTensorData<int8_t>(output), op_params.output_offset,
op_params.output_multiplier, op_params.output_shift,
op_params.quantized_activation_min, op_params.quantized_activation_max,
MatchingElementsSize(tflite::micro::GetTensorShape(input1),
tflite::micro::GetTensorShape(input2),
tflite::micro::GetTensorShape(output)));
}
return kTfLiteOk;
}
The arm_elementwise_add_s8()
function is provided by CMSIS-NN, and the
implementation leverages different hardware functionality depending on what
extensions are available.
Source/BasicMathFunctions/arm_elementwise_add_s8.c
arm_cmsis_nn_status arm_elementwise_add_s8(const int8_t *input_1_vect,
const int8_t *input_2_vect,
const int32_t input_1_offset,
const int32_t input_1_mult,
const int32_t input_1_shift,
const int32_t input_2_offset,
const int32_t input_2_mult,
const int32_t input_2_shift,
const int32_t left_shift,
int8_t *output,
const int32_t out_offset,
const int32_t out_mult,
const int32_t out_shift,
const int32_t out_activation_min,
const int32_t out_activation_max,
const int32_t block_size)
{
#if defined(ARM_MATH_MVEI)
int32_t count = block_size;
while (count > 0)
{
int32x4_t vect_1;
int32x4_t vect_2;
mve_pred16_t p = vctp32q((uint32_t)count);
vect_1 = vldrbq_z_s32(input_1_vect, p);
vect_2 = vldrbq_z_s32(input_2_vect, p);
vect_1 = vaddq_s32(vect_1, vdupq_n_s32(input_1_offset));
vect_2 = vaddq_s32(vect_2, vdupq_n_s32(input_2_offset));
vect_1 = vshlq_r_s32(vect_1, left_shift);
vect_2 = vshlq_r_s32(vect_2, left_shift);
vect_1 = arm_requantize_mve(vect_1, input_1_mult, input_1_shift);
vect_2 = arm_requantize_mve(vect_2, input_2_mult, input_2_shift);
vect_1 = vaddq_s32(vect_1, vect_2);
vect_1 = arm_requantize_mve(vect_1, out_mult, out_shift);
vect_1 = vaddq_n_s32(vect_1, out_offset);
vect_1 = vmaxq_s32(vect_1, vdupq_n_s32(out_activation_min));
vect_1 = vminq_s32(vect_1, vdupq_n_s32(out_activation_max));
input_1_vect += 4;
input_2_vect += 4;
vstrbq_p_s32(output, vect_1, p);
output += 4;
count -= 4;
}
#else
int32_t loop_count;
int32_t input_1;
int32_t input_2;
int32_t sum;
#if defined(ARM_MATH_DSP)
int32_t a_1, b_1, a_2, b_2;
int32_t offset_1_packed, offset_2_packed;
int8_t r1, r2, r3, r4;
offset_1_packed = (input_1_offset << 16U) | (input_1_offset & 0x0FFFFL);
offset_2_packed = (input_2_offset << 16U) | (input_2_offset & 0x0FFFFL);
loop_count = block_size >> 2;
while (loop_count > 0)
{
/* 4 outputs are calculated in one loop. The order of calculation is follows the order of output sign extension
intrinsic */
input_1_vect = read_and_pad_reordered(input_1_vect, &b_1, &a_1);
input_2_vect = read_and_pad_reordered(input_2_vect, &b_2, &a_2);
a_1 = SADD16(a_1, offset_1_packed);
b_1 = SADD16(b_1, offset_1_packed);
a_2 = SADD16(a_2, offset_2_packed);
b_2 = SADD16(b_2, offset_2_packed);
/* Sum 1 */
input_1 = (b_1 & 0x0FFFF) << left_shift;
input_1 = arm_nn_requantize(input_1, input_1_mult, input_1_shift);
input_2 = (b_2 & 0x0FFFF) << left_shift;
input_2 = arm_nn_requantize(input_2, input_2_mult, input_2_shift);
sum = input_1 + input_2;
sum = arm_nn_requantize(sum, out_mult, out_shift);
sum += out_offset;
sum = MAX(sum, out_activation_min);
sum = MIN(sum, out_activation_max);
r1 = (int8_t)sum;
/* Sum 3 */
input_1 = ((b_1 >> 16) & 0x0FFFF) << left_shift;
input_1 = arm_nn_requantize(input_1, input_1_mult, input_1_shift);
input_2 = ((b_2 >> 16) & 0x0FFFF) << left_shift;
input_2 = arm_nn_requantize(input_2, input_2_mult, input_2_shift);
sum = input_1 + input_2;
sum = arm_nn_requantize(sum, out_mult, out_shift);
sum += out_offset;
sum = MAX(sum, out_activation_min);
sum = MIN(sum, out_activation_max);
r3 = (int8_t)sum;
/* Sum 2 */
input_1 = (a_1 & 0x0FFFF) << left_shift;
input_1 = arm_nn_requantize(input_1, input_1_mult, input_1_shift);
input_2 = (a_2 & 0x0FFFF) << left_shift;
input_2 = arm_nn_requantize(input_2, input_2_mult, input_2_shift);
sum = input_1 + input_2;
sum = arm_nn_requantize(sum, out_mult, out_shift);
sum += out_offset;
sum = MAX(sum, out_activation_min);
sum = MIN(sum, out_activation_max);
r2 = (int8_t)sum;
/* Sum 4 */
input_1 = ((a_1 >> 16) & 0x0FFFF) << left_shift;
input_1 = arm_nn_requantize(input_1, input_1_mult, input_1_shift);
input_2 = ((a_2 >> 16) & 0x0FFFF) << left_shift;
input_2 = arm_nn_requantize(input_2, input_2_mult, input_2_shift);
sum = input_1 + input_2;
sum = arm_nn_requantize(sum, out_mult, out_shift);
sum += out_offset;
sum = MAX(sum, out_activation_min);
sum = MIN(sum, out_activation_max);
r4 = (int8_t)sum;
arm_nn_write_s8x4_ia(&output, PACK_S8x4_32x1(r1, r2, r3, r4));
loop_count--;
}
loop_count = block_size & 0x3;
#else
loop_count = block_size;
#endif
while (loop_count > 0)
{
/* C = A + B */
input_1 = (*input_1_vect++ + input_1_offset) << left_shift;
input_2 = (*input_2_vect++ + input_2_offset) << left_shift;
input_1 = arm_nn_requantize(input_1, input_1_mult, input_1_shift);
input_2 = arm_nn_requantize(input_2, input_2_mult, input_2_shift);
sum = input_1 + input_2;
sum = arm_nn_requantize(sum, out_mult, out_shift);
sum += out_offset;
sum = MAX(sum, out_activation_min);
sum = MIN(sum, out_activation_max);
*output++ = (int8_t)sum;
/* Decrement loop counter */
loop_count--;
}
#endif /* ARM_MATH_MVEI */
return (ARM_CMSIS_NN_SUCCESS);
}
For example, if the DSP extension is present, the parallel signed 16-bit
addition (SAAD16
) instruction provided by the extension is used to reduce the
number of loops necessary by packing 8-bit signed integers into 16-bit
arguments, then calculating 4 outputs on a single iteration. If the MVE
extension is present, vector addition instructions (VADD
) can be used
directly, making the calculation even more efficient.
These optimizations are made available via configuration when compiling
tflite-micro
. They can be applied to any models that utilize the operations,
without the need to modify the model when moving from one architecture to
another. Some optimizations do require modifying a model. For example, when
using microcontrollers, such as the previously mentioned Alif Ensemble E3, that
include Arm’s Ethos-U NPUs you can run .tflite
models through the Vela
compiler.
Converted models replace sequences of built-in operators with a custom ETHOSU
operator
and a command
stream.
The application processor notifies the NPU of the address of the command stream
and other relevant data, then triggers it to perform inference.
Unlike the addition operator, there is no fallback kernel implementation; models
converted via the Vela compiler cannot run on microcontrollers that do not have
Ethos-U NPUs. For those that do, we can see the previously described logic in
the Eval()
implementation for the ETHOSU
custom operator.
tensorflow/lite/micro/kernels/ethos_u/ethosu.cc
TfLiteStatus Eval(TfLiteContext* context, TfLiteNode* node) {
TFLITE_DCHECK(node->user_data != nullptr);
TFLITE_DCHECK(context != nullptr);
TFLITE_DCHECK(context->GetScratchBuffer != nullptr);
// Get base addresses.
TfLiteEvalTensor* tensor;
int i = 0;
int num_tensors = 0;
void* cms_data;
uint8_t co_type;
int result;
const OpData* data = static_cast<const OpData*>(node->user_data);
uint64_t* base_addrs = static_cast<uint64_t*>(
context->GetScratchBuffer(context, data->base_addr_idx));
size_t* base_addrs_size = static_cast<size_t*>(
context->GetScratchBuffer(context, data->base_addr_size_idx));
const uint8_t* custom_data =
static_cast<uint8_t const*>(node->custom_initial_data);
auto root = flexbuffers::GetRoot(custom_data, node->custom_initial_data_size);
co_type = root.AsInt8();
if (co_type != CO_TYPE_ETHOSU) {
MicroPrintf("CO_TYPE != ETHOSU");
return kTfLiteError;
}
// Get command stream data address.
tensor = context->GetEvalTensor(context, node->inputs->data[0]);
cms_data = reinterpret_cast<void*>(tensor->data.uint8);
// Get addresses to weights/scratch/input data.
for (i = 1; i < node->inputs->size; ++i) {
tensor = context->GetEvalTensor(context, node->inputs->data[i]);
base_addrs[num_tensors] =
static_cast<uint64_t>(reinterpret_cast<uintptr_t>(tensor->data.uint8));
size_t byte_size = 1;
for (int k = 0; k < tensor->dims->size; k++) {
byte_size = byte_size * tensor->dims->data[k];
}
base_addrs_size[num_tensors] = byte_size;
num_tensors++;
}
// Get addresses to output data.
for (i = 0; i < node->outputs->size; ++i) {
tensor = context->GetEvalTensor(context, node->outputs->data[i]);
base_addrs[num_tensors] =
static_cast<uint64_t>(reinterpret_cast<uintptr_t>(tensor->data.uint8));
size_t byte_size = 1;
for (int k = 0; k < tensor->dims->size; k++) {
byte_size = byte_size * tensor->dims->data[k];
}
base_addrs_size[num_tensors] = byte_size;
num_tensors++;
}
// When Vela optimizes a tflite file it will assign the tensors like this:
//
// +-------+------------------------+ +--------+-------------+
// | INPUT | Description | | OUTPUT | Description |
// +-------+------------------------+ +--------+-------------+
// | 0 | Ethos-U command stream | | 0..m | Outputs |
// | 1 | TFLM model | +--------+-------------+
// | 2 | TFLM arena |
// | 3 | Ethos-U fast scratch |
// | 4..n | Inputs |
// +-------+------------------------+
//
// This code will assign the NPU base addresses like this:
//
// +--------------+----------------------+
// | Base address | Description |
// +--------------+----------------------+
// | 0 | TFLM model |
// | 1 | TFLM arena |
// | 2 | Ethos-U fast scratch |
// | 3..n | Input tensors |
// | n..m | Output tensors |
// +--------------+----------------------+
//
// The number of base address will be limited to 8.
//
// NOTE! The command stream produced by Vela will access the IFM and OFM
// buffers using base address 1. This means that it is not possible to point
// the input and output tensors outside of the TFLM arena.
num_tensors = std::min(num_tensors, 8);
struct ethosu_driver* drv = ethosu_reserve_driver();
result = ethosu_invoke_v3(drv, cms_data, data->cms_data_size, base_addrs,
base_addrs_size, num_tensors,
GetMicroContext(context)->external_context());
ethosu_release_driver(drv);
if (-1 == result) {
return kTfLiteError;
} else {
return kTfLiteOk;
}
}
We’ve now seen the full spectrum of operator optimization, from kernels that are
implemented purely in C, to those that leverage hardware instructions provided
in architecture extensions, and finally to those that offload inference to a
wholly separate processor. In future posts, we’ll explore how operators are
encoded in .tflite
files, and how the runtime ultimately invokes the
underlying kernels.