How AI on Microcontrollers Actually Works: Operators and Kernels

The buzz around “edge AI”, which means something slightly different to almost everyone you talk to, is well past reaching a fever pitch. Regardless of what edge AI means to you, the one commonality is typically that the hardware on which inference is being performed is constrained in one or more dimensions, whether it be compute, memory, or network bandwidth. Perhaps the most constrained of these platforms are microcontrollers.

I have found that, while there is much discourse around “running AI” (i.e. performing inference) on microcontrollers, there is a general lack of information about what these systems are actually capable of, and how new hardware advancements impact that equation. It is my hope with this series to peel back some of the layers of terminology and explore what actually happens between supplying inputs to a model and receiving outputs. Along the way, we’ll ground our exploration in performing inference with real models on real constrained hardware.

While “weights” get the majority of the attention with AI models, they alone are not sufficient for performing inference. Depending on how a model is distributed and what runtime is used, additional data or metadata may be supplied alongside the model, or may be defined explicitly in software that interacts with the weights. The most popular runtime for microcontrollers is Tensorflow Lite for Microcontrollers (tflite-micro), which is an optimized version of Tensorflow Lite.

Note: Google recently rebranded Tensorflow Lite to LiteRT, and tflite-micro to LiteRT for Microcontrollers.

tflite-micro uses the .tflite file format, which encodes data using FlatBuffers. Unlike some other model file formats, .tflite files include not only the tensors that encapuslate model weights, but also the computation graph, which informs the runtime of what operations to use when performing inference. In order to do so, there needs to be a defined set of operators. This is somewhat analagous to instructions defined in an instruction set architecture (ISA) for a processor. With an ISA, a compiler will take a higher level programming language and map the behavior onto instructions available in the ISA. Tensorflow supports an extensive set of built-in operators, while Tensorflow Lite, and thus tflite-micro, supports only a subset.

Continuing the analaogy, many processors implement specific versions of the ARM architecture, but that doesn’t mean that processors implementing the same ISA are equivalent. Every instruction that is supported has to be implemented in hardware, and decisions about how the processor is designed can impact performance on multiple dimensions. Similarly, while Tensorflow Lite defines a set of operators, the implementation of those operators, which are referred to as kernels, may vary. Kernels are implemented in software, but depending on the underlying hardware, a kernel might take many instructions to execute, or may be able to optimized to leverage dedicated hardware support.

A simple example is the addition operator (TFL::AddOp). We’ll cover how operators and kernels are registered and invoked in a future post, but let’s start by taking a look at the default tflite-micro addition operator logic.

tensorflow/lite/micro/kernels/add.cc

TfLiteStatus AddEval(TfLiteContext* context, TfLiteNode* node) {
  auto* params = reinterpret_cast<TfLiteAddParams*>(node->builtin_data);

  TFLITE_DCHECK(node->user_data != nullptr);
  const OpDataAdd* data = static_cast<const OpDataAdd*>(node->user_data);

  const TfLiteEvalTensor* input1 =
      tflite::micro::GetEvalInput(context, node, kAddInputTensor1);
  const TfLiteEvalTensor* input2 =
      tflite::micro::GetEvalInput(context, node, kAddInputTensor2);
  TfLiteEvalTensor* output =
      tflite::micro::GetEvalOutput(context, node, kAddOutputTensor);

  if (output->type == kTfLiteFloat32 || output->type == kTfLiteInt32) {
    TF_LITE_ENSURE_OK(
        context, EvalAdd(context, node, params, data, input1, input2, output));
  } else if (output->type == kTfLiteInt8 || output->type == kTfLiteInt16) {
    TF_LITE_ENSURE_OK(context, EvalAddQuantized(context, node, params, data,
                                                input1, input2, output));
  } else {
    MicroPrintf("Type %s (%d) not supported.", TfLiteTypeGetName(output->type),
                output->type);
    return kTfLiteError;
  }

  return kTfLiteOk;
}

TFLMRegistration Register_ADD() {
  return tflite::micro::RegisterOp(AddInit, AddPrepare, AddEval);
}

As can be observed in AddEval(), the type of output we are expecting influences the implementation of the operator. To illustrate how the underlying hardware impacts performance, let’s focus on the case in which we expect kTfLiteInt8 (signed 8-bit integer) or kTfLiteInt16 (signed 16-bit integer) output, meaning that we’ll call EvalAddQuantized().

tensorflow/lite/micro/kernels/add.cc

TfLiteStatus EvalAddQuantized(TfLiteContext* context, TfLiteNode* node,
                              TfLiteAddParams* params, const OpDataAdd* data,
                              const TfLiteEvalTensor* input1,
                              const TfLiteEvalTensor* input2,
                              TfLiteEvalTensor* output) {
  tflite::ArithmeticParams op_params = {};
  op_params.left_shift = data->left_shift;
  op_params.input1_offset = data->input1_offset;
  op_params.input1_multiplier = data->input1_multiplier;
  op_params.input1_shift = data->input1_shift;
  op_params.input2_offset = data->input2_offset;
  op_params.input2_multiplier = data->input2_multiplier;
  op_params.input2_shift = data->input2_shift;
  op_params.output_offset = data->output_offset;
  op_params.output_multiplier = data->output_multiplier;
  op_params.output_shift = data->output_shift;
  SetActivationParams(data->output_activation_min, data->output_activation_max,
                      &op_params);
  bool need_broadcast = reference_ops::ProcessBroadcastShapes(
      tflite::micro::GetTensorShape(input1),
      tflite::micro::GetTensorShape(input2), &op_params);

  switch (output->type) {
    case kTfLiteInt8: {
      if (need_broadcast) {
        reference_integer_ops::BroadcastAdd4DSlow(
            op_params, tflite::micro::GetTensorShape(input1),
            tflite::micro::GetTensorData<int8_t>(input1),
            tflite::micro::GetTensorShape(input2),
            tflite::micro::GetTensorData<int8_t>(input2),
            tflite::micro::GetTensorShape(output),
            tflite::micro::GetTensorData<int8_t>(output));
      } else {
        reference_integer_ops::Add(
            op_params, tflite::micro::GetTensorShape(input1),
            tflite::micro::GetTensorData<int8_t>(input1),
            tflite::micro::GetTensorShape(input2),
            tflite::micro::GetTensorData<int8_t>(input2),
            tflite::micro::GetTensorShape(output),
            tflite::micro::GetTensorData<int8_t>(output));
      }
      break;
    }
    case kTfLiteInt16: {
      if (need_broadcast) {
        reference_ops::BroadcastAdd4DSlow(
            op_params, tflite::micro::GetTensorShape(input1),
            tflite::micro::GetTensorData<int16_t>(input1),
            tflite::micro::GetTensorShape(input2),
            tflite::micro::GetTensorData<int16_t>(input2),
            tflite::micro::GetTensorShape(output),
            tflite::micro::GetTensorData<int16_t>(output));
      } else {
        reference_ops::Add(op_params, tflite::micro::GetTensorShape(input1),
                           tflite::micro::GetTensorData<int16_t>(input1),
                           tflite::micro::GetTensorShape(input2),
                           tflite::micro::GetTensorData<int16_t>(input2),
                           tflite::micro::GetTensorShape(output),
                           tflite::micro::GetTensorData<int16_t>(output),
                           false);
      }
      break;
    }
    default:
      MicroPrintf("Type %s (%d) not supported.",
                  TfLiteTypeGetName(output->type), output->type);
      return kTfLiteError;
  }

  return kTfLiteOk;
}

For kTfLiteInt8 output when broadcast is not required, we make a call to reference_integer_ops::Add().

“Broadcasting” is the process of making arrays to have compatible shapes for arithmetic operations. For example, matrix addition requires that the two input matrices have the same dimensions.

tensorflow/lite/kernels/internal/reference/add.h

template <typename T>
inline void Add(const ArithmeticParams& params,
                const RuntimeShape& input1_shape, const T* input1_data,
                const RuntimeShape& input2_shape, const T* input2_data,
                const RuntimeShape& output_shape, T* output_data) {
  T activation_min, activation_max;
  GetActivationParams(params, &activation_min, &activation_max);

  const int flat_size =
      MatchingElementsSize(input1_shape, input2_shape, output_shape);
  for (int i = 0; i < flat_size; ++i) {
    output_data[i] = ActivationFunctionWithMinMax(
        input1_data[i] + input2_data[i], activation_min, activation_max);
  }
}

As you might expect, this implementation effectively boils down to iterating through the two input tensors and calling input1_data[i] + input2_data[i]. This can be thought of as a lowest common denominator implementation in that it doesn’t leverage any hardware-specific functionality; any processor can perform sequential addition. However, as evidenced by the effectively unlimited demand in the Graphics Processing Unit (GPU) market, there are significant performance gains to be had by parallelizing operations in hardware. Fortunately, many of the operations that are necessary for performing inference are “embarassingly parallel”. For example, rather than iterating through tensors to perform sequential addition, which may take many processor cycles, we could point a processor to the two inputs and, if supported by the hardware, the entire matrix addition operation could be completed in a “single” cycle.

It is unlikely that the operation would literally take one cycle given the complexity of modern processors, but the point is that the runtime of the entire operation could be reduced to the same order of magnitude of cycles as one step of the sequential addition implementation.

Obviously microcontrollers don’t have massive GPUs like those installed in cloud provider datacenters. However, many do implement architecture extensions that enable these common operations to be accelerated. Because Tensorflow Lite enables different kernel implementations, this hardware acceleration can be leveraged when supported.

Many microcontrollers implement Arm Cortex-M cores. For example, chips like the Raspberry Pi RP2350 and Nordic Semiconductor nRF54H20 implement multiple Arm Cortex-M33 cores. The former implements the Armv8-M Digital Signal Processing (DSP) Extension, which adds support for Single Instruction Multiple Data (SIMD) instructions. More capable chips, like the Alif Ensemble E3 implement Cortex-M55 cores, with support for the Armv8-M Vector Extension (MVE), also referred to as Arm Helium. The E3 also includes dedicated accelerators in the form of Arm’s Ethos-U Neural Processing Units (NPU).

Arm provides software that allows for hardware that supports one or more of these extensions to accelerate Tensorflow Lite kernel implementations. For example, the CMSIS-NN library offers kernel implementations that do not leverage optimization (i.e. pure C), leverage just the DSP extension, or leverage the MVE extension (which requires the implementation of the DSP extension). Tensorflow Lite has kernel “ports” that integrate the CMSIS-NN functionality. Let’s take a look at how the add operation differs when using CMSIS-NN kernels.

The setup looks largely the same as the first add operation kernel we observed. However, when we reach EvalAddQuantizedInt8(), we can start to see where hardware acceleration is leveraged.

tensorflow/lite/micro/kernels/cmsis_nn/add.cc

TfLiteStatus EvalAddQuantizedInt8(TfLiteContext* context, TfLiteNode* node,
                                  TfLiteAddParams* params, const OpData* data,
                                  const TfLiteEvalTensor* input1,
                                  const TfLiteEvalTensor* input2,
                                  TfLiteEvalTensor* output) {
  tflite::ArithmeticParams op_params;
  UpdateOpParams(&op_params, data);

  bool need_broadcast = reference_ops::ProcessBroadcastShapes(
      tflite::micro::GetTensorShape(input1),
      tflite::micro::GetTensorShape(input2), &op_params);

  if (need_broadcast) {
    reference_integer_ops::BroadcastAdd4DSlow(
        op_params, tflite::micro::GetTensorShape(input1),
        tflite::micro::GetTensorData<int8_t>(input1),
        tflite::micro::GetTensorShape(input2),
        tflite::micro::GetTensorData<int8_t>(input2),
        tflite::micro::GetTensorShape(output),
        tflite::micro::GetTensorData<int8_t>(output));
  } else {
    arm_elementwise_add_s8(
        tflite::micro::GetTensorData<int8_t>(input1),

        tflite::micro::GetTensorData<int8_t>(input2), op_params.input1_offset,
        op_params.input1_multiplier, op_params.input1_shift,
        op_params.input2_offset, op_params.input2_multiplier,
        op_params.input2_shift, op_params.left_shift,
        tflite::micro::GetTensorData<int8_t>(output), op_params.output_offset,
        op_params.output_multiplier, op_params.output_shift,
        op_params.quantized_activation_min, op_params.quantized_activation_max,
        MatchingElementsSize(tflite::micro::GetTensorShape(input1),
                             tflite::micro::GetTensorShape(input2),
                             tflite::micro::GetTensorShape(output)));
  }

  return kTfLiteOk;
}

The arm_elementwise_add_s8() function is provided by CMSIS-NN, and the implementation leverages different hardware functionality depending on what extensions are available.

Source/BasicMathFunctions/arm_elementwise_add_s8.c

arm_cmsis_nn_status arm_elementwise_add_s8(const int8_t *input_1_vect,
                                           const int8_t *input_2_vect,
                                           const int32_t input_1_offset,
                                           const int32_t input_1_mult,
                                           const int32_t input_1_shift,
                                           const int32_t input_2_offset,
                                           const int32_t input_2_mult,
                                           const int32_t input_2_shift,
                                           const int32_t left_shift,
                                           int8_t *output,
                                           const int32_t out_offset,
                                           const int32_t out_mult,
                                           const int32_t out_shift,
                                           const int32_t out_activation_min,
                                           const int32_t out_activation_max,
                                           const int32_t block_size)
{
#if defined(ARM_MATH_MVEI)
    int32_t count = block_size;

    while (count > 0)
    {
        int32x4_t vect_1;
        int32x4_t vect_2;

        mve_pred16_t p = vctp32q((uint32_t)count);

        vect_1 = vldrbq_z_s32(input_1_vect, p);
        vect_2 = vldrbq_z_s32(input_2_vect, p);

        vect_1 = vaddq_s32(vect_1, vdupq_n_s32(input_1_offset));
        vect_2 = vaddq_s32(vect_2, vdupq_n_s32(input_2_offset));

        vect_1 = vshlq_r_s32(vect_1, left_shift);
        vect_2 = vshlq_r_s32(vect_2, left_shift);

        vect_1 = arm_requantize_mve(vect_1, input_1_mult, input_1_shift);
        vect_2 = arm_requantize_mve(vect_2, input_2_mult, input_2_shift);

        vect_1 = vaddq_s32(vect_1, vect_2);
        vect_1 = arm_requantize_mve(vect_1, out_mult, out_shift);

        vect_1 = vaddq_n_s32(vect_1, out_offset);

        vect_1 = vmaxq_s32(vect_1, vdupq_n_s32(out_activation_min));
        vect_1 = vminq_s32(vect_1, vdupq_n_s32(out_activation_max));

        input_1_vect += 4;
        input_2_vect += 4;
        vstrbq_p_s32(output, vect_1, p);

        output += 4;
        count -= 4;
    }
#else
    int32_t loop_count;
    int32_t input_1;
    int32_t input_2;
    int32_t sum;

    #if defined(ARM_MATH_DSP)
    int32_t a_1, b_1, a_2, b_2;

    int32_t offset_1_packed, offset_2_packed;

    int8_t r1, r2, r3, r4;

    offset_1_packed = (input_1_offset << 16U) | (input_1_offset & 0x0FFFFL);
    offset_2_packed = (input_2_offset << 16U) | (input_2_offset & 0x0FFFFL);

    loop_count = block_size >> 2;

    while (loop_count > 0)
    {
        /* 4 outputs are calculated in one loop. The order of calculation is follows the order of output sign extension
           intrinsic */
        input_1_vect = read_and_pad_reordered(input_1_vect, &b_1, &a_1);
        input_2_vect = read_and_pad_reordered(input_2_vect, &b_2, &a_2);

        a_1 = SADD16(a_1, offset_1_packed);
        b_1 = SADD16(b_1, offset_1_packed);

        a_2 = SADD16(a_2, offset_2_packed);
        b_2 = SADD16(b_2, offset_2_packed);

        /* Sum 1 */
        input_1 = (b_1 & 0x0FFFF) << left_shift;

        input_1 = arm_nn_requantize(input_1, input_1_mult, input_1_shift);

        input_2 = (b_2 & 0x0FFFF) << left_shift;
        input_2 = arm_nn_requantize(input_2, input_2_mult, input_2_shift);

        sum = input_1 + input_2;
        sum = arm_nn_requantize(sum, out_mult, out_shift);
        sum += out_offset;
        sum = MAX(sum, out_activation_min);
        sum = MIN(sum, out_activation_max);
        r1 = (int8_t)sum;

        /* Sum 3 */
        input_1 = ((b_1 >> 16) & 0x0FFFF) << left_shift;
        input_1 = arm_nn_requantize(input_1, input_1_mult, input_1_shift);

        input_2 = ((b_2 >> 16) & 0x0FFFF) << left_shift;
        input_2 = arm_nn_requantize(input_2, input_2_mult, input_2_shift);

        sum = input_1 + input_2;
        sum = arm_nn_requantize(sum, out_mult, out_shift);
        sum += out_offset;
        sum = MAX(sum, out_activation_min);
        sum = MIN(sum, out_activation_max);
        r3 = (int8_t)sum;

        /* Sum 2 */
        input_1 = (a_1 & 0x0FFFF) << left_shift;
        input_1 = arm_nn_requantize(input_1, input_1_mult, input_1_shift);

        input_2 = (a_2 & 0x0FFFF) << left_shift;
        input_2 = arm_nn_requantize(input_2, input_2_mult, input_2_shift);

        sum = input_1 + input_2;
        sum = arm_nn_requantize(sum, out_mult, out_shift);
        sum += out_offset;
        sum = MAX(sum, out_activation_min);
        sum = MIN(sum, out_activation_max);
        r2 = (int8_t)sum;

        /* Sum 4 */
        input_1 = ((a_1 >> 16) & 0x0FFFF) << left_shift;
        input_1 = arm_nn_requantize(input_1, input_1_mult, input_1_shift);

        input_2 = ((a_2 >> 16) & 0x0FFFF) << left_shift;
        input_2 = arm_nn_requantize(input_2, input_2_mult, input_2_shift);

        sum = input_1 + input_2;
        sum = arm_nn_requantize(sum, out_mult, out_shift);
        sum += out_offset;
        sum = MAX(sum, out_activation_min);
        sum = MIN(sum, out_activation_max);
        r4 = (int8_t)sum;

        arm_nn_write_s8x4_ia(&output, PACK_S8x4_32x1(r1, r2, r3, r4));

        loop_count--;
    }

    loop_count = block_size & 0x3;
    #else
    loop_count = block_size;
    #endif

    while (loop_count > 0)
    {
        /* C = A + B */

        input_1 = (*input_1_vect++ + input_1_offset) << left_shift;
        input_2 = (*input_2_vect++ + input_2_offset) << left_shift;

        input_1 = arm_nn_requantize(input_1, input_1_mult, input_1_shift);
        input_2 = arm_nn_requantize(input_2, input_2_mult, input_2_shift);

        sum = input_1 + input_2;
        sum = arm_nn_requantize(sum, out_mult, out_shift);
        sum += out_offset;

        sum = MAX(sum, out_activation_min);
        sum = MIN(sum, out_activation_max);

        *output++ = (int8_t)sum;

        /* Decrement loop counter */
        loop_count--;
    }

#endif /* ARM_MATH_MVEI */

    return (ARM_CMSIS_NN_SUCCESS);
}

For example, if the DSP extension is present, the parallel signed 16-bit addition (SAAD16) instruction provided by the extension is used to reduce the number of loops necessary by packing 8-bit signed integers into 16-bit arguments, then calculating 4 outputs on a single iteration. If the MVE extension is present, vector addition instructions (VADD) can be used directly, making the calculation even more efficient.

These optimizations are made available via configuration when compiling tflite-micro. They can be applied to any models that utilize the operations, without the need to modify the model when moving from one architecture to another. Some optimizations do require modifying a model. For example, when using microcontrollers, such as the previously mentioned Alif Ensemble E3, that include Arm’s Ethos-U NPUs you can run .tflite models through the Vela compiler. Converted models replace sequences of built-in operators with a custom ETHOSU operator and a command stream. The application processor notifies the NPU of the address of the command stream and other relevant data, then triggers it to perform inference.

Unlike the addition operator, there is no fallback kernel implementation; models converted via the Vela compiler cannot run on microcontrollers that do not have Ethos-U NPUs. For those that do, we can see the previously described logic in the Eval() implementation for the ETHOSU custom operator.

tensorflow/lite/micro/kernels/ethos_u/ethosu.cc

TfLiteStatus Eval(TfLiteContext* context, TfLiteNode* node) {
  TFLITE_DCHECK(node->user_data != nullptr);
  TFLITE_DCHECK(context != nullptr);
  TFLITE_DCHECK(context->GetScratchBuffer != nullptr);

  // Get base addresses.
  TfLiteEvalTensor* tensor;
  int i = 0;
  int num_tensors = 0;
  void* cms_data;
  uint8_t co_type;
  int result;
  const OpData* data = static_cast<const OpData*>(node->user_data);
  uint64_t* base_addrs = static_cast<uint64_t*>(
      context->GetScratchBuffer(context, data->base_addr_idx));
  size_t* base_addrs_size = static_cast<size_t*>(
      context->GetScratchBuffer(context, data->base_addr_size_idx));

  const uint8_t* custom_data =
      static_cast<uint8_t const*>(node->custom_initial_data);
  auto root = flexbuffers::GetRoot(custom_data, node->custom_initial_data_size);
  co_type = root.AsInt8();
  if (co_type != CO_TYPE_ETHOSU) {
    MicroPrintf("CO_TYPE != ETHOSU");
    return kTfLiteError;
  }

  // Get command stream data address.
  tensor = context->GetEvalTensor(context, node->inputs->data[0]);
  cms_data = reinterpret_cast<void*>(tensor->data.uint8);

  // Get addresses to weights/scratch/input data.
  for (i = 1; i < node->inputs->size; ++i) {
    tensor = context->GetEvalTensor(context, node->inputs->data[i]);
    base_addrs[num_tensors] =
        static_cast<uint64_t>(reinterpret_cast<uintptr_t>(tensor->data.uint8));
    size_t byte_size = 1;
    for (int k = 0; k < tensor->dims->size; k++) {
      byte_size = byte_size * tensor->dims->data[k];
    }
    base_addrs_size[num_tensors] = byte_size;
    num_tensors++;
  }

  // Get addresses to output data.
  for (i = 0; i < node->outputs->size; ++i) {
    tensor = context->GetEvalTensor(context, node->outputs->data[i]);
    base_addrs[num_tensors] =
        static_cast<uint64_t>(reinterpret_cast<uintptr_t>(tensor->data.uint8));
    size_t byte_size = 1;
    for (int k = 0; k < tensor->dims->size; k++) {
      byte_size = byte_size * tensor->dims->data[k];
    }
    base_addrs_size[num_tensors] = byte_size;
    num_tensors++;
  }

  // When Vela optimizes a tflite file it will assign the tensors like this:
  //
  // +-------+------------------------+  +--------+-------------+
  // | INPUT | Description            |  | OUTPUT | Description |
  // +-------+------------------------+  +--------+-------------+
  // |     0 | Ethos-U command stream |  |   0..m | Outputs     |
  // |     1 | TFLM model             |  +--------+-------------+
  // |     2 | TFLM arena             |
  // |     3 | Ethos-U fast scratch   |
  // |  4..n | Inputs                 |
  // +-------+------------------------+
  //
  // This code will assign the NPU base addresses like this:
  //
  // +--------------+----------------------+
  // | Base address | Description          |
  // +--------------+----------------------+
  // |            0 | TFLM model           |
  // |            1 | TFLM arena           |
  // |            2 | Ethos-U fast scratch |
  // |         3..n | Input tensors        |
  // |         n..m | Output tensors       |
  // +--------------+----------------------+
  //
  // The number of base address will be limited to 8.
  //
  // NOTE! The command stream produced by Vela will access the IFM and OFM
  // buffers using base address 1. This means that it is not possible to point
  // the input and output tensors outside of the TFLM arena.
  num_tensors = std::min(num_tensors, 8);

  struct ethosu_driver* drv = ethosu_reserve_driver();
  result = ethosu_invoke_v3(drv, cms_data, data->cms_data_size, base_addrs,
                            base_addrs_size, num_tensors,
                            GetMicroContext(context)->external_context());
  ethosu_release_driver(drv);

  if (-1 == result) {
    return kTfLiteError;
  } else {
    return kTfLiteOk;
  }
}

We’ve now seen the full spectrum of operator optimization, from kernels that are implemented purely in C, to those that leverage hardware instructions provided in architecture extensions, and finally to those that offload inference to a wholly separate processor. In future posts, we’ll explore how operators are encoded in .tflite files, and how the runtime ultimately invokes the underlying kernels.