We started this series with a look at operators and
kernels,
the “instructions” used by models and the implementation of those instructions
on the available hardware. We then explored the computation
graph,
which defines the sequence of operators for a given model, and explored how
different model formats opt to include the explicit computation graph in the
distributed file, or defer it to the inference application. With
tflite-micro
and the .tflite
FlatBuffers file format sitting in the former
category, the computation graph does not have to be defined in the inference
application, but the operators utilized in the computation graph have to be made
available to the Tensorflow Lite interpreter that performs inference.
When running in less constrained environments where the literal code size of an
application is not a primary concern, ensuring that all supported operators are
always available can be a simpler approach. For example, the default in
Tensorflow is to include all
operators in the compiled library. However, the
SELECTIVE_REGISTRATION
feature can be leveraged to only include necessary operators if you are aware of
the models and their required operators at compile time. Likewise, Tensorflow
Lite applications, which are designed to run
on “edge” devices, but not necessarily microcontrollers, will typically use the
BuiltinOpResolver
for simple models.
tflite::ops::builtin::BuiltinOpResolver resolver;
When custom operators are
required, they can be
added to the builtins by constructing a MutableOpResolver
, registering the
builtins, then registering any custom operators.
tflite::ops::builtin::MutableOpResolver resolver;
resolver.AddAll(tflite::ops::builtin::BuiltinOpResolver());
resolver.AddCustom("Atan", AtanOpRegistration());
If only a subset of builtin operators are desired, they can be added one-by-one
by calling AddBuiltin()
, or a custom OpResolver
can be generated using
tooling provided in the
repository.
The
MutableOpResolver
is a derived class of the OpResolver
interface
class.
Any generated custom OpResolver
would be similarly derived, though would
likely not allow for mutating the registered operators. The primary role of an
OpResolver
is to allow the interpreter, which we’ll cover in a future post, to
lookup the registered operator implementations (i.e. kernels), if any exist.
This is evidenced by the two public virtual FindOp()
member functions on the
interface class.
tensorflow/lite/core/api/op_resolver.h
/// Abstract interface that returns TfLiteRegistrations given op codes or custom
/// op names. This is the mechanism that ops being referenced in the flatbuffer
/// model are mapped to executable function pointers (TfLiteRegistrations).
///
/// The lifetime of the TfLiteRegistration object whose address is
/// returned by FindOp must exceed the lifetime of any InterpreterBuilder or
/// Interpreter created with this OpResolver.
/// Likewise the lifetime of the TfLiteOperator object referenced
/// from the TfLiteRegistration object, if any, must exceed the lifetime of
/// any InterpreterBuilder or Interpreter created with this OpResolver.
class OpResolver {
public:
/// Finds the op registration for a builtin operator by enum code.
virtual const TfLiteRegistration* FindOp(tflite::BuiltinOperator op,
int version) const = 0;
/// Finds the op registration of a custom operator by op name.
virtual const TfLiteRegistration* FindOp(const char* op,
int version) const = 0;
…
};
In the MutableOpResolver
derived class, the FindOp
member functions are
declared, along with the additional member functions to register operators. The
registered operators are stored in the member std::unordered_map
for
builtins_
or custom_ops_
respectively.
tensorflow/lite/mutable_op_resolver.h
/// An OpResolver that is mutable, also used as the op in gen_op_registration.
/// A typical usage:
/// MutableOpResolver resolver;
/// resolver.AddBuiltin(BuiltinOperator_ADD, Register_ADD());
/// resolver.AddCustom("CustomOp", Register_CUSTOM_OP());
/// InterpreterBuilder(model, resolver)(&interpreter);
class MutableOpResolver : public OpResolver {
public:
const TfLiteRegistration* FindOp(tflite::BuiltinOperator op,
int version) const override;
const TfLiteRegistration* FindOp(const char* op, int version) const override;
/// Registers the specified `version` of the specified builtin operator `op`.
/// Replaces any previous registration for the same operator version.
void AddBuiltin(tflite::BuiltinOperator op,
const TfLiteRegistration* registration, int version = 1);
/// Registers the specified version range (versions `min_version` to
/// `max_version`, inclusive) of the specified builtin operator `op`.
/// Replaces any previous registration for the same operator version.
void AddBuiltin(tflite::BuiltinOperator op,
const TfLiteRegistration* registration, int min_version,
int max_version);
/// Registers the specified `version` of the specified builtin operator `op`.
/// Replaces any previous registration for the same operator version.
/// Warning: use of this method in new code is discouraged: for new code,
/// we recommend using tflite::AddOp (from mutable_op_resolver_utils.h)
/// rather than tflite::MutableOpResolver::AddCustom.
void AddCustom(const char* name, const TfLiteRegistration* registration,
int version = 1);
/// Registers the specified version range (versions `min_version` to
/// `max_version`, inclusive) of the specified custom operator `name`.
/// Replaces any previous registration for the same operator version.
/// Warning: use of this method in new code is discouraged: for new code,
/// we recommend using tflite::AddOp (from mutable_op_resolver_utils.h)
/// rather than tflite::MutableOpResolver::AddCustom.
void AddCustom(const char* name, const TfLiteRegistration* registration,
int min_version, int max_version);
/// Registers all operator versions supported by another MutableOpResolver.
/// Replaces any previous registrations for the same operator versions,
/// except that registrations made with `AddOp`, `AddBuiltin` or `AddCustom`
/// always take precedence over registrations made with `ChainOpResolver`.
void AddAll(const MutableOpResolver& other);
…
private:
bool MayContainUserDefinedOps() const override;
typedef std::pair<tflite::BuiltinOperator, int> BuiltinOperatorKey;
typedef std::pair<std::string, int> CustomOperatorKey;
std::unordered_map<BuiltinOperatorKey, TfLiteRegistration,
op_resolver_hasher::OperatorKeyHasher<BuiltinOperatorKey> >
builtins_;
std::unordered_map<CustomOperatorKey, TfLiteRegistration,
op_resolver_hasher::OperatorKeyHasher<CustomOperatorKey> >
custom_ops_;
std::vector<const OpResolver*> other_op_resolvers_;
};
The BuiltinOpResolver
is actually derived from the MutableOpResolver
, and
automatically adds all builtin operators.
tensorflow/lite/core/kernels/register.h
// This built-in op resolver provides a list of TfLite delegates that could be
// applied by TfLite interpreter by default.
class BuiltinOpResolver : public MutableOpResolver {
public:
// NOTE: we *deliberately* don't define any virtual functions here to avoid
// behavior changes when users pass a derived instance by value or assign a
// derived instance to a variable of this class. See "object slicing"
// (https://en.wikipedia.org/wiki/Object_slicing)) for details.
BuiltinOpResolver();
};
When the interpreter looks up an operation using FindOp()
, it is ultimately
returned a *TfLiteRegistration
. The registration is what allows the actual
kernel to eventually be invoked. The members of the struct
define the point(s)
at which each function pointer is accessed and called.
/// `TfLiteRegistration` defines the implementation of an operation
/// (a built-in op, custom op, or custom delegate kernel).
///
/// It is a struct containing "methods" (C function pointers) that will be
/// invoked by the TF Lite runtime to evaluate instances of the operation.
///
/// See also `TfLiteOperator` which is a more ABI-stable equivalent.
typedef struct TfLiteRegistration {
/// Initializes the op from serialized data.
/// Called only *once* for the lifetime of the op, so any one-time allocations
/// should be made here (unless they depend on tensor sizes).
///
/// * If a built-in op:
/// * `buffer` is the op's params data (TfLiteLSTMParams*).
/// * `length` is zero.
/// * If custom op:
/// * `buffer` is the op's `custom_options`.
/// * `length` is the size of the buffer.
///
/// Returns a type-punned (i.e. void*) opaque data (e.g. a primitive pointer
/// or an instance of a struct).
///
/// The returned pointer will be stored with the node in the `user_data`
/// field, accessible within prepare and invoke functions below.
///
/// NOTE: if the data is already in the desired format, simply implement this
/// function to return `nullptr` and implement the free function to be a
/// no-op.
void* (*init)(TfLiteContext* context, const char* buffer, size_t length);
/// The pointer `buffer` is the data previously returned by an init
/// invocation.
void (*free)(TfLiteContext* context, void* buffer);
/// prepare is called when the inputs this node depends on have been resized.
/// `context->ResizeTensor()` can be called to request output tensors to be
/// resized.
/// Can be called multiple times for the lifetime of the op.
///
/// Returns `kTfLiteOk` on success.
TfLiteStatus (*prepare)(TfLiteContext* context, TfLiteNode* node);
/// Execute the node (should read `node->inputs` and output to
/// `node->outputs`).
///
/// Returns `kTfLiteOk` on success.
TfLiteStatus (*invoke)(TfLiteContext* context, TfLiteNode* node);
/// `profiling_string` is called during summarization of profiling information
/// in order to group executions together. Providing a value here will cause a
/// given op to appear multiple times is the profiling report. This is
/// particularly useful for custom ops that can perform significantly
/// different calculations depending on their `user-data`.
const char* (*profiling_string)(const TfLiteContext* context,
const TfLiteNode* node);
/// Builtin codes. If this kernel refers to a builtin this is the code
/// of the builtin. This is so we can do marshaling to other frameworks like
/// NN API.
///
/// Note: It is the responsibility of the registration binder to set this
/// properly.
int32_t builtin_code;
/// Custom op name. If the op is a builtin, this will be `null`.
///
/// Note: It is the responsibility of the registration binder to set this
/// properly.
///
/// WARNING: This is an experimental interface that is subject to change.
const char* custom_name;
/// The version of the op.
/// Note: It is the responsibility of the registration binder to set this
/// properly.
int version;
/// The external (i.e. ABI-stable) version of `TfLiteRegistration`.
/// Since we can't use internal types (such as `TfLiteContext`) for C API to
/// maintain ABI stability. C API user will provide `TfLiteOperator` to
/// implement custom ops. We keep it inside of `TfLiteRegistration` and use
/// it to route callbacks properly.
TfLiteOperator* registration_external;
/// Retrieves asynchronous kernel.
///
/// If the `async_kernel` field is nullptr, it means the operation described
/// by this TfLiteRegistration object does not support asynchronous execution.
/// Otherwise, the function that the field points to should only be called for
/// delegate kernel nodes, i.e. `node` should be a delegate kernel node
/// created by applying a delegate. If the function returns nullptr, that
/// means that the underlying delegate does not support asynchronous execution
/// for this `node`.
struct TfLiteAsyncKernel* (*async_kernel)(TfLiteContext* context,
TfLiteNode* node);
/// Indicates if an operator's output may safely overwrite its inputs.
/// See the comments in `TfLiteInPlaceOp`.
uint64_t inplace_operator;
} TfLiteRegistration;
On microcontrollers, resources, especially flash memory (ROM) and RAM, are at a
premium. While tflite-micro
shares many similarities with Tensorflow Lite,
there are some key differences that stand out.
Tracing the history of tflite-micro
is somewhat tricky, as it has been moved
across locations in the tensorflow/tensorflow
repository, before eventually
being moved to the tensorflow/tflite-micro
repository where it resides today.
The original commit
(a7e8ad
)
was from Pete Warden in 2018, and it included
an excellent
document
describing the motivations, trade-offs, and future improvements.
These motivations, such as avoiding the use of dynamic memory allocation, are
manifested in the differences between tflite-micro
and Tensorflow Lite. For
example, tflite-micro
now implements its own stripped down version of the
OpResolver
interface class named the
MicroOpResolver
,
and a derived
MicroMutableOpResolver
template class. While the overuse of
templates can lead to
increased code size, in this case a non-type parameter (tOpCount
) is helpfully
used to assign the size of the array of registered operators at compile time
rather than using a dynamically allocated std::unordered_map
.
tensorflow/lite/micro/micro_mutable_op_resolver.h
namespace tflite {
TFLMRegistration* Register_DETECTION_POSTPROCESS();
template <unsigned int tOpCount>
class MicroMutableOpResolver : public MicroOpResolver {
public:
TF_LITE_REMOVE_VIRTUAL_DELETE
explicit MicroMutableOpResolver() {}
const TFLMRegistration* FindOp(tflite::BuiltinOperator op) const override {
if (op == BuiltinOperator_CUSTOM) return nullptr;
for (unsigned int i = 0; i < registrations_len_; ++i) {
const TFLMRegistration& registration = registrations_[i];
if (registration.builtin_code == op) {
return ®istration;
}
}
return nullptr;
}
…
// Registers a Custom Operator with the MicroOpResolver.
//
// Only the first call for a given name will be successful. i.e. if this
// function is called again for a previously added Custom Operator, the
// MicroOpResolver will be unchanged and this function will return
// kTfLiteError.
TfLiteStatus AddCustom(const char* name,
const TFLMRegistration* registration) {
if (registrations_len_ >= tOpCount) {
MicroPrintf(
"Couldn't register custom op '%s', resolver size is too"
"small (%d)",
name, tOpCount);
return kTfLiteError;
}
if (FindOp(name) != nullptr) {
MicroPrintf("Calling AddCustom for the same op more than once ");
MicroPrintf("is not supported (Op: %s).", name);
return kTfLiteError;
}
TFLMRegistration* new_registration = ®istrations_[registrations_len_];
registrations_len_ += 1;
*new_registration = *registration;
new_registration->builtin_code = BuiltinOperator_CUSTOM;
new_registration->custom_name = name;
return kTfLiteOk;
}
// The Add* functions below add the various Builtin operators to the
// MicroMutableOpResolver object.
TfLiteStatus AddAbs() {
return AddBuiltin(BuiltinOperator_ABS, Register_ABS(), ParseAbs);
}
TfLiteStatus AddAdd(const TFLMRegistration& registration = Register_ADD()) {
return AddBuiltin(BuiltinOperator_ADD, registration, ParseAdd);
}
…
TfLiteStatus AddWindow() {
// TODO(b/286250473): change back name to "Window" and remove namespace
return AddCustom("SignalWindow", tflite::tflm_signal::Register_WINDOW());
}
TfLiteStatus AddZerosLike() {
return AddBuiltin(BuiltinOperator_ZEROS_LIKE, Register_ZEROS_LIKE(),
ParseZerosLike);
}
unsigned int GetRegistrationLength() { return registrations_len_; }
private:
TfLiteStatus AddBuiltin(tflite::BuiltinOperator op,
const TFLMRegistration& registration,
TfLiteBridgeBuiltinParseFunction parser) {
if (op == BuiltinOperator_CUSTOM) {
MicroPrintf("Invalid parameter BuiltinOperator_CUSTOM to the ");
MicroPrintf("AddBuiltin function.");
return kTfLiteError;
}
if (FindOp(op) != nullptr) {
MicroPrintf("Calling AddBuiltin with the same op more than ");
MicroPrintf("once is not supported (Op: #%d).", op);
return kTfLiteError;
}
if (registrations_len_ >= tOpCount) {
MicroPrintf("Couldn't register builtin op #%d, resolver size ", op);
MicroPrintf("is too small (%d).", tOpCount);
return kTfLiteError;
}
registrations_[registrations_len_] = registration;
// Strictly speaking, the builtin_code is not necessary for TFLM but
// filling it in regardless.
registrations_[registrations_len_].builtin_code = op;
registrations_len_++;
builtin_codes_[num_buitin_ops_] = op;
builtin_parsers_[num_buitin_ops_] = parser;
num_buitin_ops_++;
return kTfLiteOk;
}
TFLMRegistration registrations_[tOpCount];
unsigned int registrations_len_ = 0;
// Arrays (and counter) to store the builtin codes and their corresponding
// parse functions as these are registered with the Op Resolver.
BuiltinOperator builtin_codes_[tOpCount];
TfLiteBridgeBuiltinParseFunction builtin_parsers_[tOpCount];
unsigned int num_buitin_ops_ = 0;
};
}; // namespace tflite
We’ve seen the MicroMutableOpResolver
in action in previous posts when looking
at registering operators in examples, such as the person_detection
application.
Let’s take a look at the even simpler hello_world
example, which registers a
single operator
(FULLY_CONNECTED
).
Because we actually want to (finally!) build an application that targets a real
microcontroller, we’ll use the Zephyr RTOS
variant
of the example. The instantiation of the MicroMutableOpResolver
takes place in
the setup()
function.
samples/modules/tflite-micro/hello_world/src/main_functions.cpp
void setup(void)
{
/* Map the model into a usable data structure. This doesn't involve any
* copying or parsing, it's a very lightweight operation.
*/
model = tflite::GetModel(g_model);
if (model->version() != TFLITE_SCHEMA_VERSION) {
MicroPrintf("Model provided is schema version %d not equal "
"to supported version %d.",
model->version(), TFLITE_SCHEMA_VERSION);
return;
}
/* This pulls in the operation implementations we need.
* NOLINTNEXTLINE(runtime-global-variables)
*/
static tflite::MicroMutableOpResolver <1> resolver;
resolver.AddFullyConnected();
/* Build an interpreter to run the model with. */
static tflite::MicroInterpreter static_interpreter(
model, resolver, tensor_arena, kTensorArenaSize);
interpreter = &static_interpreter;
/* Allocate memory from the tensor_arena for the model's tensors. */
TfLiteStatus allocate_status = interpreter->AllocateTensors();
if (allocate_status != kTfLiteOk) {
MicroPrintf("AllocateTensors() failed");
return;
}
/* Obtain pointers to the model's input and output tensors. */
input = interpreter->input(0);
output = interpreter->output(0);
/* Keep track of how many inferences we have performed. */
inference_count = 0;
}
We’ll build the example for an Arm Cortex-M33 microcontroller, namely the Nordic Semiconductor nRF9160 on the nRF9160 Development Kit (DK). After following the Zephyr setup instructions, the sample can be built for our target with the following command.
west build -p -b nrf9160dk/nrf9160/ns samples/modules/tflite-micro/hello_world
While this is an extremely minimal use case, we can see that the application ROM
footprint, which includes both the tflite-micro
components and the Zephyr
kernel, is fairly small (56 KB).
Memory region Used Size Region Size %age Used
FLASH: 56604 B 192 KB 28.79%
RAM: 7608 B 168 KB 4.42%
IDT_LIST: 0 GB 32 KB 0.00%
We can use Zephyr’s rom_report
target (-t rom_report
) to see how much
tflite-micro
, and specifically operator resolution, is contributing to the
code size.
Scroll right to see size and % of ROM used.
├── optional 14948 26.42% -
│ └── modules 14948 26.42% -
│ └── lib 14948 26.42% -
│ └── tflite-micro 14948 26.42% -
│ ├── tensorflow 14788 26.14% -
...
│ │ └── lite 14678 25.94% -
...
│ │ ├── kernels 1422 2.51% -
│ │ │ ├── internal 782 1.38% -
│ │ │ │ ├── common.cc 260 0.46% -
│ │ │ │ │ ├── _ZN6tflite29MultiplyByQuantizedMultiplierEiii 132 0.23% 0x00054acd text
│ │ │ │ │ └── _ZN6tflite29MultiplyByQuantizedMultiplierExii 128 0.23% 0x00054b51 text
│ │ │ │ ├── portable_tensor_utils.cc 54 0.10% -
│ │ │ │ │ └── _ZN6tflite12tensor_utils23UnpackDenseInt4IntoInt8EPKaiPa 54 0.10% 0x0005a313 text
│ │ │ │ ├── quantization_util.cc 116 0.21% -
│ │ │ │ │ └── _ZN6tflite18QuantizeMultiplierEdPiS0_ 116 0.21% 0x00054bd1 text
│ │ │ │ ├── reference 256 0.45% -
│ │ │ │ │ └── integer_ops 256 0.45% -
│ │ │ │ │ └── fully_connected.h 256 0.45% -
│ │ │ │ │ └── _ZN6tflite21reference_integer_ops14FullyConnectedIaaaiEEvRKNS_20FullyConnectedParamsERKNS_12RuntimeShapeEPKT_S7_PKT0_S7_PKT2_S7_PT1_.isra.0 256 0.45% 0x0005a3d5 text
...
│ │ ├── micro 12652 22.36% -
...
│ │ │ ├── kernels 2524 4.46% -
│ │ │ │ ├── fully_connected.cc 1858 3.28% -
│ │ │ │ │ ├── _ZN6tflite12_GLOBAL__N_118FullyConnectedEvalEP13TfLiteContextP10TfLiteNode 1380 2.44% 0x0005514d text
│ │ │ │ │ ├── _ZN6tflite12_GLOBAL__N_118FullyConnectedInitEP13TfLiteContextPKcj 18 0.03% 0x0005a395 text
│ │ │ │ │ ├── _ZN6tflite12_GLOBAL__N_121FullyConnectedPrepareEP13TfLiteContextP10TfLiteNode 420 0.74% 0x00054f81 text
│ │ │ │ │ └── _ZN6tflite24Register_FULLY_CONNECTEDEv 40 0.07% 0x00055125 text
│ │ │ │ ├── fully_connected_common.cc 502 0.89% -
│ │ │ │ │ ├── _ZN6tflite25FullyConnectedParamsFloatE21TfLiteFusedActivation 56 0.10% 0x000556b1 text
│ │ │ │ │ ├── _ZN6tflite29CalculateOpDataFullyConnectedEP13TfLiteContext21TfLiteFusedActivation10TfLiteTypePK12TfLiteTensorS6_S6_PS4_PNS_20OpDataFullyConnectedE 412 0.73% 0x000556e9 text
│ │ │ │ │ └── _ZN6tflite29FullyConnectedParamsQuantizedERKNS_20OpDataFullyConnectedE 34 0.06% 0x0005a4d5 text
│ │ │ │ ├── kernel_util.cc 152 0.27% -
│ │ │ │ │ ├── _ZN6tflite5micro10RegisterOpEPFPvP13TfLiteContextPKcjEPF12TfLiteStatusS3_P10TfLiteNodeESC_PFvS3_S1_ESE_ 24 0.04% 0x0005a4f7 text
│ │ │ │ │ ├── _ZN6tflite5micro12GetEvalInputEPK13TfLiteContextPK10TfLiteNodei 4 0.01% 0x0005a545 text
│ │ │ │ │ ├── _ZN6tflite5micro13GetEvalOutputEPK13TfLiteContextPK10TfLiteNodei 28 0.05% 0x0005a549 text
│ │ │ │ │ ├── _ZN6tflite5micro14GetTensorShapeEPK16TfLiteEvalTensor 42 0.07% 0x0005a565 text
│ │ │ │ │ └── _ZN6tflite5micro19GetMutableEvalInputEPK13TfLiteContextPK10TfLiteNodei 54 0.10% 0x0005a50f text
│ │ │ │ └── kernel_util.h 12 0.02% -
│ │ │ │ └── _ZN6tflite5micro13GetTensorDataIaEEPKT_PK16TfLiteEvalTensor 12 0.02% 0x0005a389 text
...
│ │ │ ├── micro_mutable_op_resolver.h 128 0.23% -
│ │ │ │ ├── _ZN6tflite22MicroMutableOpResolverILj1EED0Ev 14 0.02% 0x00058541 text
│ │ │ │ ├── _ZN6tflite22MicroMutableOpResolverILj1EED2Ev 2 0.00% 0x000584e7 text
│ │ │ │ ├── _ZNK6tflite22MicroMutableOpResolverILj1EE15GetOpDataParserENS_15BuiltinOperatorE 30 0.05% 0x000584e9 text
│ │ │ │ ├── _ZNK6tflite22MicroMutableOpResolverILj1EE6FindOpENS_15BuiltinOperatorE 24 0.04% 0x0005854f text
│ │ │ │ └── _ZNK6tflite22MicroMutableOpResolverILj1EE6FindOpEPKc 58 0.10% 0x00058507 text
│ │ │ ├── micro_op_resolver.cc 148 0.26% -
│ │ │ │ └── _ZN6tflite25GetRegistrationFromOpCodeEPKNS_12OperatorCodeERKNS_15MicroOpResolverEPPK16TFLMRegistration 148 0.26% 0x0005466d text
│ │ │ ├── micro_profiler.h 20 0.04% -
│ │ │ │ └── _ZN6tflite19ScopedMicroProfilerD1Ev 20 0.04% 0x00059835 text
...
The MicroMutableOpResolver
member function for registering the
FULLY_CONNECTED
operator is structured as follows.
tensorflow/lite/micro/micro_mutable_op_resolver.h
TfLiteStatus AddFullyConnected(
const TFLMRegistration& registration = Register_FULLY_CONNECTED()) {
return AddBuiltin(BuiltinOperator_FULLY_CONNECTED, registration,
ParseFullyConnected);
}
If we dump the body of the setup()
function, we can see that, as expected due
to being defined in the header file, AddFullyConnected
and AddBuiltin
are
inlined and we end up calling Register_FULLY_CONNECTED
directly.
~/.local/zephyr-sdk-0.17.0/arm-zephyr-eabi/bin/arm-zephyr-eabi-objdump -D build/zephyr/zephyr.elf
Portions of the
<setup>
dump are removed for brevity.
0005191c <setup>:
5191c: b5f0 push {r4, r5, r6, r7, lr}
5191e: 4b47 ldr r3, [pc, #284] ; (51a3c <setup+0x120>)
51920: 4f47 ldr r7, [pc, #284] ; (51a40 <setup+0x124>)
51922: 6819 ldr r1, [r3, #0]
51924: b08d sub sp, #52 ; 0x34
...
51972: a805 add r0, sp, #20
51974: f003 fbd6 bl 55124 <_ZN6tflite24Register_FULLY_CONNECTEDEv>
51978: 2109 movs r1, #9
5197a: 4834 ldr r0, [pc, #208] ; (51a4c <setup+0x130>)
5197c: f006 fde7 bl 5854e <_ZNK6tflite22MicroMutableOpResolverILj1EE6FindOpENS_15BuiltinOperatorE>
51980: b358 cbz r0, 519da <setup+0xbe>
...
519da: 6a2b ldr r3, [r5, #32]
519dc: b133 cbz r3, 519ec <setup+0xd0>
519de: 2109 movs r1, #9
519e0: 4826 ldr r0, [pc, #152] ; (51a7c <setup+0x160>)
519e2: f008 fa4e bl 59e82 <_Z11MicroPrintfPKcz>
519e6: 2101 movs r1, #1
519e8: 4825 ldr r0, [pc, #148] ; (51a80 <setup+0x164>)
519ea: e7cf b.n 5198c <setup+0x70>
519ec: 4e25 ldr r6, [pc, #148] ; (51a84 <setup+0x168>)
519ee: ac05 add r4, sp, #20
519f0: cc0f ldmia r4!, {r0, r1, r2, r3}
519f2: c60f stmia r6!, {r0, r1, r2, r3}
519f4: e894 0007 ldmia.w r4, {r0, r1, r2}
519f8: 2301 movs r3, #1
519fa: e886 0007 stmia.w r6, {r0, r1, r2}
519fe: 2209 movs r2, #9
51a00: 622b str r3, [r5, #32]
51a02: 6aeb ldr r3, [r5, #44] ; 0x2c
51a04: 61aa str r2, [r5, #24]
51a06: eb05 0183 add.w r1, r5, r3, lsl #2
51a0a: 624a str r2, [r1, #36] ; 0x24
51a0c: 491e ldr r1, [pc, #120] ; (51a88 <setup+0x16c>)
51a0e: f103 020a add.w r2, r3, #10
51a12: 3301 adds r3, #1
51a14: f845 1022 str.w r1, [r5, r2, lsl #2]
51a18: 62eb str r3, [r5, #44] ; 0x2c
51a1a: e7b9 b.n 51990 <setup+0x74>
...
51a38: b00d add sp, #52 ; 0x34
51a3a: bdf0 pop {r4, r5, r6, r7, pc}
51a3c: 0005afa8 andeq sl, r5, r8, lsr #31
51a40: 20016384 andcs r6, r1, r4, lsl #7
51a44: 0005c2e8 andeq ip, r5, r8, ror #5
51a48: 20016340 andcs r6, r1, r0, asr #6
51a4c: 20016344 andcs r6, r1, r4, asr #6
51a50: 0005bbfc strdeq fp, [r5], -ip
51a54: 20016388 andcs r6, r1, r8, lsl #7
51a58: 000584e7 andeq r8, r5, r7, ror #9
51a5c: 0005c32f andeq ip, r5, pc, lsr #6
51a60: 0005c35e andeq ip, r5, lr, asr r3
51a64: 20016278 andcs r6, r1, r8, ror r2
51a68: 2001649c mulcs r1, ip, r4
51a6c: 2001627c andcs r6, r1, ip, ror r2
51a70: 000595d9 ldrdeq r9, [r5], -r9 ; <UNPREDICTABLE>
51a74: 20016380 andcs r6, r1, r0, lsl #7
51a78: 0005c3c3 andeq ip, r5, r3, asr #7
51a7c: 0005c37f andeq ip, r5, pc, ror r3
51a80: 0005c3b0 ; <UNDEFINED> instruction: 0x0005c3b0
51a84: 20016348 andcs r6, r1, r8, asr #6
51a88: 00054e99 muleq r5, r9, lr
51a8c: 2001637c andcs r6, r1, ip, ror r3
51a90: 20016378 andcs r6, r1, r8, ror r3
51a94: 20016374 andcs r6, r1, r4, ror r3
Prior to calling Register_FULLY_CONNECTED()
, we set aside a region on the
stack (add r0, sp, #20
) for storing the returned
TFLMRegistration
for the FULLY_CONNECTED
operator. Register_FULLY_CONNECTED()
subsequently
calls the helper RegisterOp()
function.
tensorflow/lite/micro/kernels/fully_connected.cc
TFLMRegistration Register_FULLY_CONNECTED() {
return tflite::micro::RegisterOp(FullyConnectedInit, FullyConnectedPrepare,
FullyConnectedEval);
}
The last three addresses in the Register_FULLY_CONNECTED()
disassembly are not
instructions, but rather pointers to the FullyConnectedPrepare
(0x00054f80
),
FullyConnectedEval
(0x000541c
), and FullyConnectedInit
(0x0005a394
)
functions. They are passed in registers to the RegisterOp()
helper.
Note that the least significant bit in all three function addresses is
1
, indicating the use of the Thumb compressed instruction set.
00055124 <_ZN6tflite24Register_FULLY_CONNECTEDEv>:
55124: 2300 movs r3, #0
55126: b513 push {r0, r1, r4, lr}
55128: 4604 mov r4, r0
5512a: e9cd 3300 strd r3, r3, [sp]
5512e: 4a04 ldr r2, [pc, #16] ; (55140 <_ZN6tflite24Register_FULLY_CONNECTEDEv+0x1c>)
55130: 4b04 ldr r3, [pc, #16] ; (55144 <_ZN6tflite24Register_FULLY_CONNECTEDEv+0x20>)
55132: 4905 ldr r1, [pc, #20] ; (55148 <_ZN6tflite24Register_FULLY_CONNECTEDEv+0x24>)
55134: f005 f9df bl 5a4f6 <_ZN6tflite5micro10RegisterOpEPFPvP13TfLiteContextPKcjEPF12TfLiteStatusS3_P10TfLiteNodeESC_PFvS3_S1_ESE_>
55138: 4620 mov r0, r4
5513a: b002 add sp, #8
5513c: bd10 pop {r4, pc}
5513e: bf00 nop
55140: 00054f81 andeq r4, r5, r1, lsl #31
55144: 0005514d andeq r5, r5, sp, asr #2
55148: 0005a395 muleq r5, r5, r3
The RegisterOp()
helper simply constructs and returns the TFLMRegistration
.
tensorflow/lite/micro/kernels/kernel_util.cc
TFLMRegistration RegisterOp(
void* (*init)(TfLiteContext* context, const char* buffer, size_t length),
TfLiteStatus (*prepare)(TfLiteContext* context, TfLiteNode* node),
TfLiteStatus (*invoke)(TfLiteContext* context, TfLiteNode* node),
void (*free)(TfLiteContext* context, void* buffer),
void (*reset)(TfLiteContext* context, void* buffer)) {
return {/*init=*/init,
/*free=*/free,
/*prepare=*/prepare,
/*invoke=*/invoke,
/*reset*/ reset,
/*builtin_code=*/0,
/*custom_name=*/nullptr};
}
r0
still contains the address of the region on the stack that we made for the
TFLMRegistration
prior to calling Register_FULLY_CONNECTED()
, so the passed
function pointers and other data are stored at offsets relative the address.
0005a4f6 <_ZN6tflite5micro10RegisterOpEPFPvP13TfLiteContextPKcjEPF12TfLiteStatusS3_P10TfLiteNodeESC_PFvS3_S1_ESE_>:
5a4f6: b510 push {r4, lr}
5a4f8: 60c3 str r3, [r0, #12]
5a4fa: 9b03 ldr r3, [sp, #12]
5a4fc: 6001 str r1, [r0, #0]
5a4fe: 6103 str r3, [r0, #16]
5a500: 2300 movs r3, #0
5a502: 9902 ldr r1, [sp, #8]
5a504: e9c0 3305 strd r3, r3, [r0, #20]
5a508: e9c0 1201 strd r1, r2, [r0, #4]
5a50c: bd10 pop {r4, pc}
When returning back out to setup()
, the TFLMRegistration
on the stack needs
to be copied to the resolver
registrations_
array. The resolver
was
instantiated with static
linkage, meaning it will not be stored on the stack.
This is necessary as the lifetime of the OpResvoler
used with a tflite-micro
interpreter needs to have a lifetime at least as long as the interpreter itself.
The resolver
exists in memory next to the static_interpreter
.
20016278 <_ZGVZ5setupE18static_interpreter>:
20016278: 00000000 andeq r0, r0, r0
2001627c <_ZZ5setupE18static_interpreter>:
...
20016340 <_ZGVZ5setupE8resolver>:
20016340: 00000000 andeq r0, r0, r0
20016344 <_ZZ5setupE8resolver>:
...
The bottom section of the setup()
dump includes addresses of these objects.
After constructing the TFLMRegistration
on the stack, we subsequently copy it
to the registrations_
member array of the resolver
.
519ec: 4e25 ldr r6, [pc, #148] ; (51a84 <setup+0x168>)
519ee: ac05 add r4, sp, #20
519f0: cc0f ldmia r4!, {r0, r1, r2, r3}
519f2: c60f stmia r6!, {r0, r1, r2, r3}
519f4: e894 0007 ldmia.w r4, {r0, r1, r2}
519f8: 2301 movs r3, #1
519fa: e886 0007 stmia.w r6, {r0, r1, r2}
Now that we have an idea of how each operator is registered, we can see why only
registering those that are stricly necessary can have a significant impact on
whether or not performing inference on microcontrollers is feasible. For
example, additionally registering the five operators (AVERAGE_POOL_2D
,
CONV_2D
, DEPTHWISE_CONV_2D
, RESHAPE
, and SOFTMAX
) used in the
person_detection
example from our last post to our resolver
increases ROM
usage by more than 20 KB.
Memory region Used Size Region Size %age Used
FLASH: 76708 B 192 KB 39.02%
RAM: 7792 B 168 KB 4.53%
IDT_LIST: 0 GB 32 KB 0.00%
We’ve focused on optimizing the footprint of the tflite-micro
framework in
this post, while happily ignoring the impact of the size and structure of the
model itself. Typically, models will be much larger than the 2488
bytes
occupied by the simple one used in the hello_world
example. In upcoming posts,
we’ll start to look at model optimization techniques and see how they impact
both size and runtime performance.