How AI on Microcontrollers Actually Works: The Computation Graph

In our last post we explored operators and kernels in Tensorflow Lite, and how the ability to swap out kernels depending on the hardware capabilities available can lead to dramatic performance improvements when performing inference. We made an analogy of operators to instruction set architectures (ISAs), and kernels to the hardware implementation of instructions in a processor.

Just like in traditional computer programs, the sequence of instructions in a model needs to be encoded and distributed in some type of file, such as an Executable and Linkable Format (ELF) on Unix-based systems or Portable Executable (PE) on Windows. As mentioned in the last post, there are multiple file formats used for distributing models, each of which has different trade-offs. Perhaps the most significant bifurcation of model file formats is between those which include the computation graph, and those that don’t.

The .tflite file format used by tflite-micro, the most popular framework for performing inference on microcontrollers, falls into the former category, along with the popular ONNX format. Encoding the computation graph means that models can be distributed as a single file, and any runtime that supports the operators used in the compute graph can perform inference with the model. Alternatively, file formats such as GGUF, Safetensors, and PyTorch’s pickle format, require accompanying code to be distributed to perform inference with each type of model architecture.

llamafile is another interesting project, which distributes models as cross-platform executables by combining llama.cpp and Justine Tunney’s Cosmopolitan Libc.

To illustrate this point, we can take a look at an example of using a GGUF model with ggml, the tensor library that powers llama.cpp and whisper.cpp. YOLO (“You Only Look Once”) models are a popular choice for real-time object detection. ggml shows how you can perform object detection using YOLOv3-tiny, which requires converting the model to GGUF, loading it, then building the computation graph, before performing inference. The part that we are interested in for this post is the construction of the computation graph (gmml_cgraph).

examples/yolo/yolov3-tiny.cpp

static struct ggml_cgraph * build_graph(struct ggml_context * ctx_cgraph, const yolo_model & model) {
    struct ggml_cgraph * gf = ggml_new_graph(ctx_cgraph);

    struct ggml_tensor * input = ggml_new_tensor_4d(ctx_cgraph, GGML_TYPE_F32, model.width, model.height, 3, 1);
    ggml_set_name(input, "input");
    struct ggml_tensor * result = apply_conv2d(ctx_cgraph, input, model.conv2d_layers[0]);
    print_shape(0, result);
    result = ggml_pool_2d(ctx_cgraph, result, GGML_OP_POOL_MAX, 2, 2, 2, 2, 0, 0);
    print_shape(1, result);
    result = apply_conv2d(ctx_cgraph, result, model.conv2d_layers[1]);
    print_shape(2, result);
    result = ggml_pool_2d(ctx_cgraph, result, GGML_OP_POOL_MAX, 2, 2, 2, 2, 0, 0);
    print_shape(3, result);
    result = apply_conv2d(ctx_cgraph, result, model.conv2d_layers[2]);
    print_shape(4, result);
    result = ggml_pool_2d(ctx_cgraph, result, GGML_OP_POOL_MAX, 2, 2, 2, 2, 0, 0);
    print_shape(5, result);
    result = apply_conv2d(ctx_cgraph, result, model.conv2d_layers[3]);
    print_shape(6, result);
    result = ggml_pool_2d(ctx_cgraph, result, GGML_OP_POOL_MAX, 2, 2, 2, 2, 0, 0);
    print_shape(7, result);
    result = apply_conv2d(ctx_cgraph, result, model.conv2d_layers[4]);
    struct ggml_tensor * layer_8 = result;
    print_shape(8, result);
    result = ggml_pool_2d(ctx_cgraph, result, GGML_OP_POOL_MAX, 2, 2, 2, 2, 0, 0);
    print_shape(9, result);
    result = apply_conv2d(ctx_cgraph, result, model.conv2d_layers[5]);
    print_shape(10, result);
    result = ggml_pool_2d(ctx_cgraph, result, GGML_OP_POOL_MAX, 2, 2, 1, 1, 0.5, 0.5);
    print_shape(11, result);
    result = apply_conv2d(ctx_cgraph, result, model.conv2d_layers[6]);
    print_shape(12, result);
    result = apply_conv2d(ctx_cgraph, result, model.conv2d_layers[7]);
    struct ggml_tensor * layer_13 = result;
    print_shape(13, result);
    result = apply_conv2d(ctx_cgraph, result, model.conv2d_layers[8]);
    print_shape(14, result);
    result = apply_conv2d(ctx_cgraph, result, model.conv2d_layers[9]);
    struct ggml_tensor * layer_15 = result;
    ggml_set_output(layer_15);
    ggml_set_name(layer_15, "layer_15");

    print_shape(15, result);
    result = apply_conv2d(ctx_cgraph, layer_13, model.conv2d_layers[10]);
    print_shape(18, result);
    result = ggml_upscale(ctx_cgraph, result, 2, GGML_SCALE_MODE_NEAREST);
    print_shape(19, result);
    result = ggml_concat(ctx_cgraph, result, layer_8, 2);
    print_shape(20, result);
    result = apply_conv2d(ctx_cgraph, result, model.conv2d_layers[11]);
    print_shape(21, result);
    result = apply_conv2d(ctx_cgraph, result, model.conv2d_layers[12]);
    struct ggml_tensor * layer_22 = result;
    ggml_set_output(layer_22);
    ggml_set_name(layer_22, "layer_22");
    print_shape(22, result);

    ggml_build_forward_expand(gf, layer_15);
    ggml_build_forward_expand(gf, layer_22);
    return gf;
}

We can see how each node in the graph has to be defined, passing the relevant wieghts from the model, prior to performing inference. We can compare this to the tflite-micro person detection example. The example uses a different computer vision model, MobileNetV1, which is better suited to constrained environments than YOLO models. However, with both models being Convolutional Nueral Networks (CNN), the operations used will be familiar even if the architectures differ. The setup() function performs steps needed to load the model prior to performing inference.

tensorflow/lite/micro/examples/person_detection/main_functions.cc

void setup() {
  tflite::InitializeTarget();

  // Map the model into a usable data structure. This doesn't involve any
  // copying or parsing, it's a very lightweight operation.
  model = tflite::GetModel(g_person_detect_model_data);
  if (model->version() != TFLITE_SCHEMA_VERSION) {
    MicroPrintf(
        "Model provided is schema version %d not equal "
        "to supported version %d.",
        model->version(), TFLITE_SCHEMA_VERSION);
    return;
  }

  // Pull in only the operation implementations we need.
  // This relies on a complete list of all the ops needed by this graph.

  // NOLINTNEXTLINE(runtime-global-variables)
  static tflite::MicroMutableOpResolver<5> micro_op_resolver;
  micro_op_resolver.AddAveragePool2D(tflite::Register_AVERAGE_POOL_2D_INT8());
  micro_op_resolver.AddConv2D(tflite::Register_CONV_2D_INT8());
  micro_op_resolver.AddDepthwiseConv2D(
      tflite::Register_DEPTHWISE_CONV_2D_INT8());
  micro_op_resolver.AddReshape();
  micro_op_resolver.AddSoftmax(tflite::Register_SOFTMAX_INT8());

  // Build an interpreter to run the model with.
  // NOLINTNEXTLINE(runtime-global-variables)
  static tflite::MicroInterpreter static_interpreter(
      model, micro_op_resolver, tensor_arena, kTensorArenaSize);
  interpreter = &static_interpreter;

  // Allocate memory from the tensor_arena for the model's tensors.
  TfLiteStatus allocate_status = interpreter->AllocateTensors();
  if (allocate_status != kTfLiteOk) {
    MicroPrintf("AllocateTensors() failed");
    return;
  }

  // Get information about the memory area to use for the model's input.
  input = interpreter->input(0);
}

Rather than defining each node in the model, we only need to register the operators that the model uses (i.e. AveragePool2D, Conv2D, etc.). The sequence of these operators, and their parameters, are defined in the model itself, which is provided as a .tflite file and converted to a C array for inclusion in the compiled program. This is necessary as many firmware applications targeting microcontrollers will not include a filesystem.

In the last post, we mentioned that the .tflite format is based on FlatBuffers, an efficient binary encoding format commonly used as an alternative to formats such as Protobuf and JSON. FlatBuffers supports an Interface Definition Language (IDL) used for defining a schema. Looking at the .tflite schema gives us an idea of the information encoded in the files. FlatBuffers objects are defined as tables with fields, and the .tflite schema expectedly defines Model as the root type.

tensorflow/compiler/mlir/lite/schema/schema.fbs

table Model {
  // Version of the schema.
  version:uint;

  // A list of all operator codes used in this model. This is
  // kept in order because operators carry an index into this
  // vector.
  operator_codes:[OperatorCode];

  // All the subgraphs of the model. The 0th is assumed to be the main
  // model.
  subgraphs:[SubGraph];

  // A description of the model.
  description:string;

  // Buffers of the model.
  // Note the 0th entry of this array must be an empty buffer (sentinel).
  // This is a convention so that tensors without a buffer can provide 0 as
  // their buffer.
  buffers:[Buffer];

  // Metadata about the model. Indirects into the existings buffers list.
  // Deprecated, prefer to use metadata field.
  metadata_buffer:[int];

  // Metadata about the model.
  metadata:[Metadata];

  // Optional SignatureDefs for the model.
  signature_defs:[SignatureDef];
}

root_type Model;

The operator_codes define the operators that must be registered to perform inference with this model. In the setup() code we saw earlier, we registered five operators, so we should expect to see the same count in the person_detect.tflite file. The OperatorCode format is also defined in the schema.

tensorflow/compiler/mlir/lite/schema/schema.fbs

// An OperatorCode can be an enum value (BuiltinOperator) if the operator is a
// builtin, or a string if the operator is custom.
table OperatorCode {
  // This field is for backward compatibility. This field will be used when
  // the value of the extended builtin_code field has less than
  // BulitinOperator_PLACEHOLDER_FOR_GREATER_OP_CODES.
  deprecated_builtin_code:byte;
  custom_code:string;

  // The version of the operator. The version need to be bumped whenever new
  // parameters are introduced into an op.
  version:int = 1;

  // This field is introduced for resolving op builtin code shortage problem
  // (the original BuiltinOperator enum field was represented as a byte).
  // This field will be used when the value of the extended builtin_code field
  // has greater than BulitinOperator_PLACEHOLDER_FOR_GREATER_OP_CODES.
  builtin_code:BuiltinOperator;
}

To more easily extract information from the model file, we can use flatc, the FlatBuffers compiler, to convert the file into JSON.

flatc --json --raw-binary schema.fbs -- person_detect.tflite

At the top of the file, we can see the list of the operator codes, which correspond to entries in the BuiltinOperator enum.

{
  version: 3,
  operator_codes: [
    {
      deprecated_builtin_code: 1,
      version: 2
    },
    {
      deprecated_builtin_code: 3,
      version: 2
    },
    {
      deprecated_builtin_code: 4,
      version: 3
    },
    {
      deprecated_builtin_code: 22
    },
    {
      deprecated_builtin_code: 25,
      version: 2
    }
  ],
...
}

Sure enough, AVERAGE_POOL_2D (1), CONV_2D (3), DEPTHWISE_CONV_2D (4), RESHAPE (22), and SOFTMAX (25) are all listed.

tensorflow/compiler/mlir/lite/schema/schema.fbs

// A list of builtin operators. Builtin operators are slightly faster than custom
// ones, but not by much. Moreover, while custom operators accept an opaque
// object containing configuration parameters, builtins have a predetermined
// set of acceptable options.
// LINT.IfChange
enum BuiltinOperator : int32 {
  ADD = 0,
  AVERAGE_POOL_2D = 1,
  CONCATENATION = 2,
  CONV_2D = 3,
  DEPTHWISE_CONV_2D = 4,
  ...
  RESHAPE = 22,
  RESIZE_BILINEAR = 23,
  RNN = 24,
  SOFTMAX = 25,
  ...
}

The nodes that use these operators are defined in the subgraphs. For MobileNetV1, there is only a single subgraph.

tensorflow/compiler/mlir/lite/schema/schema.fbs

// The root type, defining a subgraph, which typically represents an entire
// model.
table SubGraph {
  // A list of all tensors used in this subgraph.
  tensors:[Tensor];

  // Indices of the tensors that are inputs into this subgraph. Note this is
  // the list of non-static tensors that feed into the subgraph for inference.
  inputs:[int];

  // Indices of the tensors that are outputs out of this subgraph. Note this is
  // the list of output tensors that are considered the product of the
  // subgraph's inference.
  outputs:[int];

  // All operators, in execution order.
  operators:[Operator];

  // Name of this subgraph (used for debugging).
  name:string;

  // Index into subgraphs_debug_metadata list.
  debug_metadata_index: int = -1;
}

The vast majority of the .tflite file is dedicated to the weights (tensors), but given that there is only a single subgraph for MobileNetV1, the array of operators defines all of the nodes in the compute graph for the model. The inputs and outputs for each operator correspond to tensors in the tensors array. All nodes are executed sequentially, as evidenced by the fact that the index of the output tensor from the previous operator is always supplied as one of the inputs to the next operator.

      operators: [
        {
          opcode_index: 2,
          inputs: [
            88,
            0,
            33
          ],
          outputs: [
            34
          ],
          builtin_options_type: "DepthwiseConv2DOptions",
          builtin_options: {
            stride_w: 2,
            stride_h: 2,
            depth_multiplier: 8,
            fused_activation_function: "RELU6"
          }
        },
        {
          opcode_index: 2,
          inputs: [
            34,
            9,
            52
          ],
          outputs: [
            51
          ],
          builtin_options_type: "DepthwiseConv2DOptions",
          builtin_options: {
            stride_w: 1,
            stride_h: 1,
            depth_multiplier: 1,
            fused_activation_function: "RELU6"
          }
        },
        {
          opcode_index: 1,
          inputs: [
            51,
            10,
            53
          ],
          outputs: [
            54
          ],
          builtin_options_type: "Conv2DOptions",
          builtin_options: {
            stride_w: 1,
            stride_h: 1,
            fused_activation_function: "RELU6"
          }
        },
        ...
        {
          opcode_index: 3,
          inputs: [
            28,
            32
          ],
          outputs: [
            31
          ],
          builtin_options_type: "ReshapeOptions",
          builtin_options: {
            new_shape: [
              1,
              2
            ]
          }
        },
        {
          opcode_index: 4,
          inputs: [
            31
          ],
          outputs: [
            87
          ],
          builtin_options_type: "SoftmaxOptions",
          builtin_options: {
            beta: 1.0
          }
        }
      ]
    }
  ]

If we look at the third operator, the opcode_index is 1, which corresponds to the CONV_2D operator. This node is effectively the same as one of the apply_conv2d() calls in the ggml example. In that case, the input weights were being loaded from the GGUF file and passed directly to the apply_conv2d() function in the form of model.conv2d_layers[1].

At the end of last post we took a look at a custom operator (ETHOSU), which is used on microcontrollers with Arm’s Ethos-U Nueral Processing Units (NPUs). We also mentioned that making use of the performance improvements offered by the NPU required passing a .tflite model through the Vela compiler, but stopped short of exploring the result of doing so.

The network tester example uses person_detect_vela.tflite, which is the same MobileNetV1 model used in the person detection model, but passed through the Vela compiler. When the NPU is available, the single ETHOSU operator is registered.

tensorflow/lite/micro/examples/network_tester/network_tester_test.cc

#ifdef ETHOS_U
  const tflite::Model* model = ::tflite::GetModel(g_person_detect_model_data);
#else
  const tflite::Model* model = ::tflite::GetModel(network_model);
#endif
  if (model->version() != TFLITE_SCHEMA_VERSION) {
    MicroPrintf(
        "Model provided is schema version %d not equal "
        "to supported version %d.\n",
        model->version(), TFLITE_SCHEMA_VERSION);
    return kTfLiteError;
  }
#ifdef ETHOS_U
  tflite::MicroMutableOpResolver<1> resolver;
  resolver.AddEthosU();

#else
  tflite::MicroMutableOpResolver<5> resolver;
  resolver.AddAveragePool2D(tflite::Register_AVERAGE_POOL_2D_INT8());
  resolver.AddConv2D(tflite::Register_CONV_2D_INT8());
  resolver.AddDepthwiseConv2D(tflite::Register_DEPTHWISE_CONV_2D_INT8());
  resolver.AddReshape();
  resolver.AddSoftmax(tflite::Register_SOFTMAX_INT8());

#endif

The remainder of the application looks the same for the most part, regardless of whether using the NPU or not. To see how this manifests in the model, we can once again use flatc to convert to JSON.

flatc --json --raw-binary schema.fbs -- person_detect_vela.tflite

The first thing to notice is that there is, expectedly, a single operator listed in the operator_codes array.

{
  version: 3,
  operator_codes: [
    {
      deprecated_builtin_code: 32,
      custom_code: "ethos-u",
      builtin_code: "CUSTOM"
    }
  ],
...
}

Perhaps slightly more surprising is that there is also just a single node in the model.

      operators: [
        {
          inputs: [
            1,
            2,
            3,
            4,
            5
          ],
          outputs: [
            0
          ],
          custom_options: [
            1,
            4,
            1
          ],
          mutating_variable_inputs: [

          ],
          intermediates: [

          ]
        }
      ]

How could the same model be reduced to such a simplified form? When we consider that the ETHOSU kernel effectively just points the NPU at the location of commands and data in memory, it makes sense that the work done by the tflite-micro runtime directly would be quite minimal. However, all of the data that is passed to the NPU needs to also be available. The top-level buffers array in the Model table contains the all of the relevant data, and the tensors that are passed as inputs to the ETHOSU operator reference the appropriate buffer by index.

      tensors: [
        {
          shape: [
            1,
            2
          ],
          type: "INT8",
          name: "MobilenetV1/Predictions/Reshape_1",
          quantization: {
            scale: [
              0.003906
            ],
            zero_point: [
              -128
            ]
          }
        },
        {
          shape: [
            8168
          ],
          type: "UINT8",
          buffer: 1,
          name: "_split_1_command_stream"
        },
        {
          shape: [
            260480
          ],
          type: "UINT8",
          buffer: 2,
          name: "_split_1_flash"
        },
        {
          shape: [
            74464
          ],
          type: "UINT8",
          name: "_split_1_scratch"
        },
        {
          shape: [
            74464
          ],
          type: "UINT8",
          name: "_split_1_scratch_fast"
        },
        {
          shape: [
            1,
            96,
            96,
            1
          ],
          type: "INT8",
          name: "input",
          quantization: {
            min: [
              -1.0
            ],
            max: [
              1.0
            ],
            scale: [
              0.007843
            ],
            zero_point: [
              -1
            ]
          }
        }
      ]

This example demonstrates the flexibility of the .tflite model format, as well as the general portability offered by formats that include the compute graph in the model file. We’ll continue exploring trade-offs offered by formats and runtimes, and dive deeper into how operator registration and kernel execution works, in future posts.