In our last post we explored operators and kernels in Tensorflow Lite, and how the ability to swap out kernels depending on the hardware capabilities available can lead to dramatic performance improvements when performing inference. We made an analogy of operators to instruction set architectures (ISAs), and kernels to the hardware implementation of instructions in a processor.
Just like in traditional computer programs, the sequence of instructions in a model needs to be encoded and distributed in some type of file, such as an Executable and Linkable Format (ELF) on Unix-based systems or Portable Executable (PE) on Windows. As mentioned in the last post, there are multiple file formats used for distributing models, each of which has different trade-offs. Perhaps the most significant bifurcation of model file formats is between those which include the computation graph, and those that don’t.
The .tflite
file format used by
tflite-micro
, the most popular
framework for performing inference on microcontrollers, falls into the former
category, along with the popular ONNX format. Encoding the
computation graph means that models can be distributed as a single file, and any
runtime that supports the operators used in the compute graph can perform
inference with the model. Alternatively, file formats such as
GGUF,
Safetensors, and
PyTorch’s pickle
format, require accompanying
code to be distributed to perform inference with each type of model
architecture.
llamafile
is another interesting project, which distributes models as cross-platform executables by combiningllama.cpp
and Justine Tunney’s Cosmopolitan Libc.
To illustrate this point, we can take a look at an example of using a GGUF model
with ggml
, the tensor library that powers
llama.cpp
and
whisper.cpp
. YOLO (“You Only Look
Once”) models are a popular choice for
real-time object detection. ggml
shows how you can perform object detection
using
YOLOv3-tiny
,
which requires converting the model to GGUF, loading it, then building the
computation graph, before performing inference. The part that we are interested
in for this post is the construction of the computation graph (gmml_cgraph
).
static struct ggml_cgraph * build_graph(struct ggml_context * ctx_cgraph, const yolo_model & model) {
struct ggml_cgraph * gf = ggml_new_graph(ctx_cgraph);
struct ggml_tensor * input = ggml_new_tensor_4d(ctx_cgraph, GGML_TYPE_F32, model.width, model.height, 3, 1);
ggml_set_name(input, "input");
struct ggml_tensor * result = apply_conv2d(ctx_cgraph, input, model.conv2d_layers[0]);
print_shape(0, result);
result = ggml_pool_2d(ctx_cgraph, result, GGML_OP_POOL_MAX, 2, 2, 2, 2, 0, 0);
print_shape(1, result);
result = apply_conv2d(ctx_cgraph, result, model.conv2d_layers[1]);
print_shape(2, result);
result = ggml_pool_2d(ctx_cgraph, result, GGML_OP_POOL_MAX, 2, 2, 2, 2, 0, 0);
print_shape(3, result);
result = apply_conv2d(ctx_cgraph, result, model.conv2d_layers[2]);
print_shape(4, result);
result = ggml_pool_2d(ctx_cgraph, result, GGML_OP_POOL_MAX, 2, 2, 2, 2, 0, 0);
print_shape(5, result);
result = apply_conv2d(ctx_cgraph, result, model.conv2d_layers[3]);
print_shape(6, result);
result = ggml_pool_2d(ctx_cgraph, result, GGML_OP_POOL_MAX, 2, 2, 2, 2, 0, 0);
print_shape(7, result);
result = apply_conv2d(ctx_cgraph, result, model.conv2d_layers[4]);
struct ggml_tensor * layer_8 = result;
print_shape(8, result);
result = ggml_pool_2d(ctx_cgraph, result, GGML_OP_POOL_MAX, 2, 2, 2, 2, 0, 0);
print_shape(9, result);
result = apply_conv2d(ctx_cgraph, result, model.conv2d_layers[5]);
print_shape(10, result);
result = ggml_pool_2d(ctx_cgraph, result, GGML_OP_POOL_MAX, 2, 2, 1, 1, 0.5, 0.5);
print_shape(11, result);
result = apply_conv2d(ctx_cgraph, result, model.conv2d_layers[6]);
print_shape(12, result);
result = apply_conv2d(ctx_cgraph, result, model.conv2d_layers[7]);
struct ggml_tensor * layer_13 = result;
print_shape(13, result);
result = apply_conv2d(ctx_cgraph, result, model.conv2d_layers[8]);
print_shape(14, result);
result = apply_conv2d(ctx_cgraph, result, model.conv2d_layers[9]);
struct ggml_tensor * layer_15 = result;
ggml_set_output(layer_15);
ggml_set_name(layer_15, "layer_15");
print_shape(15, result);
result = apply_conv2d(ctx_cgraph, layer_13, model.conv2d_layers[10]);
print_shape(18, result);
result = ggml_upscale(ctx_cgraph, result, 2, GGML_SCALE_MODE_NEAREST);
print_shape(19, result);
result = ggml_concat(ctx_cgraph, result, layer_8, 2);
print_shape(20, result);
result = apply_conv2d(ctx_cgraph, result, model.conv2d_layers[11]);
print_shape(21, result);
result = apply_conv2d(ctx_cgraph, result, model.conv2d_layers[12]);
struct ggml_tensor * layer_22 = result;
ggml_set_output(layer_22);
ggml_set_name(layer_22, "layer_22");
print_shape(22, result);
ggml_build_forward_expand(gf, layer_15);
ggml_build_forward_expand(gf, layer_22);
return gf;
}
We can see how each node in the graph has to be defined, passing the relevant
wieghts from the model, prior to performing inference. We can compare this to
the tflite-micro
person
detection
example.
The example uses a different computer vision model,
MobileNetV1, which is better suited
to constrained environments than YOLO models. However, with both models being
Convolutional Nueral Networks
(CNN), the
operations used will be familiar even if the architectures differ. The setup()
function performs steps needed to load the model prior to performing inference.
tensorflow/lite/micro/examples/person_detection/main_functions.cc
void setup() {
tflite::InitializeTarget();
// Map the model into a usable data structure. This doesn't involve any
// copying or parsing, it's a very lightweight operation.
model = tflite::GetModel(g_person_detect_model_data);
if (model->version() != TFLITE_SCHEMA_VERSION) {
MicroPrintf(
"Model provided is schema version %d not equal "
"to supported version %d.",
model->version(), TFLITE_SCHEMA_VERSION);
return;
}
// Pull in only the operation implementations we need.
// This relies on a complete list of all the ops needed by this graph.
// NOLINTNEXTLINE(runtime-global-variables)
static tflite::MicroMutableOpResolver<5> micro_op_resolver;
micro_op_resolver.AddAveragePool2D(tflite::Register_AVERAGE_POOL_2D_INT8());
micro_op_resolver.AddConv2D(tflite::Register_CONV_2D_INT8());
micro_op_resolver.AddDepthwiseConv2D(
tflite::Register_DEPTHWISE_CONV_2D_INT8());
micro_op_resolver.AddReshape();
micro_op_resolver.AddSoftmax(tflite::Register_SOFTMAX_INT8());
// Build an interpreter to run the model with.
// NOLINTNEXTLINE(runtime-global-variables)
static tflite::MicroInterpreter static_interpreter(
model, micro_op_resolver, tensor_arena, kTensorArenaSize);
interpreter = &static_interpreter;
// Allocate memory from the tensor_arena for the model's tensors.
TfLiteStatus allocate_status = interpreter->AllocateTensors();
if (allocate_status != kTfLiteOk) {
MicroPrintf("AllocateTensors() failed");
return;
}
// Get information about the memory area to use for the model's input.
input = interpreter->input(0);
}
Rather than defining each node in the model, we only need to register the
operators that the model uses (i.e. AveragePool2D
, Conv2D
, etc.). The
sequence of these operators, and their parameters, are defined in the model
itself, which is provided as a .tflite
file
and converted to a C array for inclusion in the compiled program. This is
necessary as many firmware applications targeting microcontrollers will not
include a filesystem.
In the last post, we mentioned that the .tflite
format is based on
FlatBuffers, an efficient binary encoding format
commonly used as an alternative to formats such as Protobuf and JSON.
FlatBuffers supports an Interface Definition Language
(IDL) used for defining a schema. Looking at
the .tflite
schema gives us an idea of the information encoded in the files.
FlatBuffers objects are defined as
tables with fields, and the .tflite
schema expectedly defines Model
as the root
type.
tensorflow/compiler/mlir/lite/schema/schema.fbs
table Model {
// Version of the schema.
version:uint;
// A list of all operator codes used in this model. This is
// kept in order because operators carry an index into this
// vector.
operator_codes:[OperatorCode];
// All the subgraphs of the model. The 0th is assumed to be the main
// model.
subgraphs:[SubGraph];
// A description of the model.
description:string;
// Buffers of the model.
// Note the 0th entry of this array must be an empty buffer (sentinel).
// This is a convention so that tensors without a buffer can provide 0 as
// their buffer.
buffers:[Buffer];
// Metadata about the model. Indirects into the existings buffers list.
// Deprecated, prefer to use metadata field.
metadata_buffer:[int];
// Metadata about the model.
metadata:[Metadata];
// Optional SignatureDefs for the model.
signature_defs:[SignatureDef];
}
root_type Model;
The operator_codes
define the operators that must be registered to perform
inference with this model. In the setup()
code we saw earlier, we registered
five operators, so we should expect to see the same count in the
person_detect.tflite
file. The OperatorCode
format is also defined in the
schema.
tensorflow/compiler/mlir/lite/schema/schema.fbs
// An OperatorCode can be an enum value (BuiltinOperator) if the operator is a
// builtin, or a string if the operator is custom.
table OperatorCode {
// This field is for backward compatibility. This field will be used when
// the value of the extended builtin_code field has less than
// BulitinOperator_PLACEHOLDER_FOR_GREATER_OP_CODES.
deprecated_builtin_code:byte;
custom_code:string;
// The version of the operator. The version need to be bumped whenever new
// parameters are introduced into an op.
version:int = 1;
// This field is introduced for resolving op builtin code shortage problem
// (the original BuiltinOperator enum field was represented as a byte).
// This field will be used when the value of the extended builtin_code field
// has greater than BulitinOperator_PLACEHOLDER_FOR_GREATER_OP_CODES.
builtin_code:BuiltinOperator;
}
To more easily extract information from the model file, we can use
flatc
, the FlatBuffers compiler, to convert
the file into JSON.
flatc --json --raw-binary schema.fbs -- person_detect.tflite
At the top of the file, we can see the list of the operator codes, which
correspond to entries in the BuiltinOperator
enum
.
{
version: 3,
operator_codes: [
{
deprecated_builtin_code: 1,
version: 2
},
{
deprecated_builtin_code: 3,
version: 2
},
{
deprecated_builtin_code: 4,
version: 3
},
{
deprecated_builtin_code: 22
},
{
deprecated_builtin_code: 25,
version: 2
}
],
...
}
Sure enough, AVERAGE_POOL_2D
(1
), CONV_2D
(3
), DEPTHWISE_CONV_2D
(4
), RESHAPE
(22
), and SOFTMAX
(25
) are all listed.
tensorflow/compiler/mlir/lite/schema/schema.fbs
// A list of builtin operators. Builtin operators are slightly faster than custom
// ones, but not by much. Moreover, while custom operators accept an opaque
// object containing configuration parameters, builtins have a predetermined
// set of acceptable options.
// LINT.IfChange
enum BuiltinOperator : int32 {
ADD = 0,
AVERAGE_POOL_2D = 1,
CONCATENATION = 2,
CONV_2D = 3,
DEPTHWISE_CONV_2D = 4,
...
RESHAPE = 22,
RESIZE_BILINEAR = 23,
RNN = 24,
SOFTMAX = 25,
...
}
The nodes that use these operators are defined in the subgraphs
. For
MobileNetV1, there is only a single subgraph.
tensorflow/compiler/mlir/lite/schema/schema.fbs
// The root type, defining a subgraph, which typically represents an entire
// model.
table SubGraph {
// A list of all tensors used in this subgraph.
tensors:[Tensor];
// Indices of the tensors that are inputs into this subgraph. Note this is
// the list of non-static tensors that feed into the subgraph for inference.
inputs:[int];
// Indices of the tensors that are outputs out of this subgraph. Note this is
// the list of output tensors that are considered the product of the
// subgraph's inference.
outputs:[int];
// All operators, in execution order.
operators:[Operator];
// Name of this subgraph (used for debugging).
name:string;
// Index into subgraphs_debug_metadata list.
debug_metadata_index: int = -1;
}
The vast majority of the .tflite
file is dedicated to the weights (tensors
),
but given that there is only a single subgraph for MobileNetV1, the array of
operators
defines all of the nodes in the compute graph for the model. The
inputs
and outputs
for each operator correspond to tensors in the tensors
array. All nodes are executed sequentially, as evidenced by the fact that the
index of the output tensor from the previous operator is always supplied as one
of the inputs to the next operator.
operators: [
{
opcode_index: 2,
inputs: [
88,
0,
33
],
outputs: [
34
],
builtin_options_type: "DepthwiseConv2DOptions",
builtin_options: {
stride_w: 2,
stride_h: 2,
depth_multiplier: 8,
fused_activation_function: "RELU6"
}
},
{
opcode_index: 2,
inputs: [
34,
9,
52
],
outputs: [
51
],
builtin_options_type: "DepthwiseConv2DOptions",
builtin_options: {
stride_w: 1,
stride_h: 1,
depth_multiplier: 1,
fused_activation_function: "RELU6"
}
},
{
opcode_index: 1,
inputs: [
51,
10,
53
],
outputs: [
54
],
builtin_options_type: "Conv2DOptions",
builtin_options: {
stride_w: 1,
stride_h: 1,
fused_activation_function: "RELU6"
}
},
...
{
opcode_index: 3,
inputs: [
28,
32
],
outputs: [
31
],
builtin_options_type: "ReshapeOptions",
builtin_options: {
new_shape: [
1,
2
]
}
},
{
opcode_index: 4,
inputs: [
31
],
outputs: [
87
],
builtin_options_type: "SoftmaxOptions",
builtin_options: {
beta: 1.0
}
}
]
}
]
If we look at the third operator, the opcode_index
is 1
, which corresponds
to the CONV_2D
operator. This node is effectively the same as one of the
apply_conv2d()
calls in the ggml
example. In that case, the input weights
were being loaded from the GGUF file and passed directly to the apply_conv2d()
function in the form of model.conv2d_layers[1]
.
At the end of last post we took a look at a custom operator
(ETHOSU
),
which is used on microcontrollers with Arm’s Ethos-U Nueral Processing Units
(NPUs). We
also mentioned that making use of the performance improvements offered by the
NPU required passing a .tflite
model through the Vela
compiler,
but stopped short of exploring the result of doing so.
The network tester
example
uses
person_detect_vela.tflite
,
which is the same MobileNetV1 model used in the person detection model, but
passed through the Vela compiler. When the NPU is available, the single ETHOSU
operator is registered.
tensorflow/lite/micro/examples/network_tester/network_tester_test.cc
#ifdef ETHOS_U
const tflite::Model* model = ::tflite::GetModel(g_person_detect_model_data);
#else
const tflite::Model* model = ::tflite::GetModel(network_model);
#endif
if (model->version() != TFLITE_SCHEMA_VERSION) {
MicroPrintf(
"Model provided is schema version %d not equal "
"to supported version %d.\n",
model->version(), TFLITE_SCHEMA_VERSION);
return kTfLiteError;
}
#ifdef ETHOS_U
tflite::MicroMutableOpResolver<1> resolver;
resolver.AddEthosU();
#else
tflite::MicroMutableOpResolver<5> resolver;
resolver.AddAveragePool2D(tflite::Register_AVERAGE_POOL_2D_INT8());
resolver.AddConv2D(tflite::Register_CONV_2D_INT8());
resolver.AddDepthwiseConv2D(tflite::Register_DEPTHWISE_CONV_2D_INT8());
resolver.AddReshape();
resolver.AddSoftmax(tflite::Register_SOFTMAX_INT8());
#endif
The remainder of the application looks the same for the most part, regardless of
whether using the NPU or not. To see how this manifests in the model, we can
once again use flatc
to convert to JSON.
flatc --json --raw-binary schema.fbs -- person_detect_vela.tflite
The first thing to notice is that there is, expectedly, a single operator listed
in the operator_codes
array.
{
version: 3,
operator_codes: [
{
deprecated_builtin_code: 32,
custom_code: "ethos-u",
builtin_code: "CUSTOM"
}
],
...
}
Perhaps slightly more surprising is that there is also just a single node in the model.
operators: [
{
inputs: [
1,
2,
3,
4,
5
],
outputs: [
0
],
custom_options: [
1,
4,
1
],
mutating_variable_inputs: [
],
intermediates: [
]
}
]
How could the same model be reduced to such a simplified form? When we consider
that the ETHOSU
kernel effectively just points the NPU at the location of
commands and data in memory, it makes sense that the work done by the
tflite-micro
runtime directly would be quite minimal. However, all of the data
that is passed to the NPU needs to also be available. The top-level buffers
array in the Model
table contains the all of the relevant data, and the
tensors
that are passed as inputs
to the ETHOSU
operator reference the
appropriate buffer by index.
tensors: [
{
shape: [
1,
2
],
type: "INT8",
name: "MobilenetV1/Predictions/Reshape_1",
quantization: {
scale: [
0.003906
],
zero_point: [
-128
]
}
},
{
shape: [
8168
],
type: "UINT8",
buffer: 1,
name: "_split_1_command_stream"
},
{
shape: [
260480
],
type: "UINT8",
buffer: 2,
name: "_split_1_flash"
},
{
shape: [
74464
],
type: "UINT8",
name: "_split_1_scratch"
},
{
shape: [
74464
],
type: "UINT8",
name: "_split_1_scratch_fast"
},
{
shape: [
1,
96,
96,
1
],
type: "INT8",
name: "input",
quantization: {
min: [
-1.0
],
max: [
1.0
],
scale: [
0.007843
],
zero_point: [
-1
]
}
}
]
This example demonstrates the flexibility of the .tflite
model format, as well
as the general portability offered by formats that include the compute graph in
the model file. We’ll continue exploring trade-offs offered by formats and
runtimes, and dive deeper into how operator registration and kernel execution
works, in future posts.