How AI on Microcontrollers Actually Works: Registering Operators

We started this series with a look at operators and kernels, the “instructions” used by models and the implementation of those instructions on the available hardware. We then explored the computation graph, which defines the sequence of operators for a given model, and explored how different model formats opt to include the explicit computation graph in the distributed file, or defer it to the inference application. With tflite-micro and the .tflite FlatBuffers file format sitting in the former category, the computation graph does not have to be defined in the inference application, but the operators utilized in the computation graph have to be made available to the Tensorflow Lite interpreter that performs inference.

When running in less constrained environments where the literal code size of an application is not a primary concern, ensuring that all supported operators are always available can be a simpler approach. For example, the default in Tensorflow is to include all operators in the compiled library. However, the SELECTIVE_REGISTRATION feature can be leveraged to only include necessary operators if you are aware of the models and their required operators at compile time. Likewise, Tensorflow Lite applications, which are designed to run on “edge” devices, but not necessarily microcontrollers, will typically use the BuiltinOpResolver for simple models.

tflite::ops::builtin::BuiltinOpResolver resolver;

When custom operators are required, they can be added to the builtins by constructing a MutableOpResolver, registering the builtins, then registering any custom operators.

tflite::ops::builtin::MutableOpResolver resolver;
resolver.AddAll(tflite::ops::builtin::BuiltinOpResolver());
resolver.AddCustom("Atan", AtanOpRegistration());

If only a subset of builtin operators are desired, they can be added one-by-one by calling AddBuiltin(), or a custom OpResolver can be generated using tooling provided in the repository. The MutableOpResolver is a derived class of the OpResolver interface class. Any generated custom OpResolver would be similarly derived, though would likely not allow for mutating the registered operators. The primary role of an OpResolver is to allow the interpreter, which we’ll cover in a future post, to lookup the registered operator implementations (i.e. kernels), if any exist. This is evidenced by the two public virtual FindOp() member functions on the interface class.

tensorflow/lite/core/api/op_resolver.h

/// Abstract interface that returns TfLiteRegistrations given op codes or custom
/// op names. This is the mechanism that ops being referenced in the flatbuffer
/// model are mapped to executable function pointers (TfLiteRegistrations).
///
/// The lifetime of the TfLiteRegistration object whose address is
/// returned by FindOp must exceed the lifetime of any InterpreterBuilder or
/// Interpreter created with this OpResolver.
/// Likewise the lifetime of the TfLiteOperator object referenced
/// from the TfLiteRegistration object, if any, must exceed the lifetime of
/// any InterpreterBuilder or Interpreter created with this OpResolver.
class OpResolver {
 public:
  /// Finds the op registration for a builtin operator by enum code.
  virtual const TfLiteRegistration* FindOp(tflite::BuiltinOperator op,
                                           int version) const = 0;
  /// Finds the op registration of a custom operator by op name.
  virtual const TfLiteRegistration* FindOp(const char* op,
                                           int version) const = 0;

…

};

In the MutableOpResolver derived class, the FindOp member functions are declared, along with the additional member functions to register operators. The registered operators are stored in the member std::unordered_map for builtins_ or custom_ops_ respectively.

tensorflow/lite/mutable_op_resolver.h

/// An OpResolver that is mutable, also used as the op in gen_op_registration.
/// A typical usage:
///   MutableOpResolver resolver;
///   resolver.AddBuiltin(BuiltinOperator_ADD, Register_ADD());
///   resolver.AddCustom("CustomOp", Register_CUSTOM_OP());
///   InterpreterBuilder(model, resolver)(&interpreter);
class MutableOpResolver : public OpResolver {
 public:
  const TfLiteRegistration* FindOp(tflite::BuiltinOperator op,
                                   int version) const override;
  const TfLiteRegistration* FindOp(const char* op, int version) const override;

  /// Registers the specified `version` of the specified builtin operator `op`.
  /// Replaces any previous registration for the same operator version.
  void AddBuiltin(tflite::BuiltinOperator op,
                  const TfLiteRegistration* registration, int version = 1);

  /// Registers the specified version range (versions `min_version` to
  /// `max_version`, inclusive) of the specified builtin operator `op`.
  /// Replaces any previous registration for the same operator version.
  void AddBuiltin(tflite::BuiltinOperator op,
                  const TfLiteRegistration* registration, int min_version,
                  int max_version);

  /// Registers the specified `version` of the specified builtin operator `op`.
  /// Replaces any previous registration for the same operator version.
  /// Warning: use of this method in new code is discouraged: for new code,
  /// we recommend using tflite::AddOp (from mutable_op_resolver_utils.h)
  /// rather than tflite::MutableOpResolver::AddCustom.
  void AddCustom(const char* name, const TfLiteRegistration* registration,
                 int version = 1);

  /// Registers the specified version range (versions `min_version` to
  /// `max_version`, inclusive) of the specified custom operator `name`.
  /// Replaces any previous registration for the same operator version.
  /// Warning: use of this method in new code is discouraged: for new code,
  /// we recommend using tflite::AddOp (from mutable_op_resolver_utils.h)
  /// rather than tflite::MutableOpResolver::AddCustom.
  void AddCustom(const char* name, const TfLiteRegistration* registration,
                 int min_version, int max_version);

  /// Registers all operator versions supported by another MutableOpResolver.
  /// Replaces any previous registrations for the same operator versions,
  /// except that registrations made with `AddOp`, `AddBuiltin` or `AddCustom`
  /// always take precedence over registrations made with `ChainOpResolver`.
  void AddAll(const MutableOpResolver& other);

…

 private:
  bool MayContainUserDefinedOps() const override;

  typedef std::pair<tflite::BuiltinOperator, int> BuiltinOperatorKey;
  typedef std::pair<std::string, int> CustomOperatorKey;

  std::unordered_map<BuiltinOperatorKey, TfLiteRegistration,
                     op_resolver_hasher::OperatorKeyHasher<BuiltinOperatorKey> >
      builtins_;
  std::unordered_map<CustomOperatorKey, TfLiteRegistration,
                     op_resolver_hasher::OperatorKeyHasher<CustomOperatorKey> >
      custom_ops_;
  std::vector<const OpResolver*> other_op_resolvers_;
};

The BuiltinOpResolver is actually derived from the MutableOpResolver, and automatically adds all builtin operators.

tensorflow/lite/core/kernels/register.h

// This built-in op resolver provides a list of TfLite delegates that could be
// applied by TfLite interpreter by default.
class BuiltinOpResolver : public MutableOpResolver {
 public:
  // NOTE: we *deliberately* don't define any virtual functions here to avoid
  // behavior changes when users pass a derived instance by value or assign a
  // derived instance to a variable of this class. See "object slicing"
  // (https://en.wikipedia.org/wiki/Object_slicing)) for details.
  BuiltinOpResolver();
};

When the interpreter looks up an operation using FindOp(), it is ultimately returned a *TfLiteRegistration. The registration is what allows the actual kernel to eventually be invoked. The members of the struct define the point(s) at which each function pointer is accessed and called.

/// `TfLiteRegistration` defines the implementation of an operation
/// (a built-in op, custom op, or custom delegate kernel).
///
/// It is a struct containing "methods" (C function pointers) that will be
/// invoked by the TF Lite runtime to evaluate instances of the operation.
///
/// See also `TfLiteOperator` which is a more ABI-stable equivalent.
typedef struct TfLiteRegistration {
  /// Initializes the op from serialized data.
  /// Called only *once* for the lifetime of the op, so any one-time allocations
  /// should be made here (unless they depend on tensor sizes).
  ///
  /// * If a built-in op:
  ///       * `buffer` is the op's params data (TfLiteLSTMParams*).
  ///       * `length` is zero.
  /// * If custom op:
  ///       * `buffer` is the op's `custom_options`.
  ///       * `length` is the size of the buffer.
  ///
  /// Returns a type-punned (i.e. void*) opaque data (e.g. a primitive pointer
  /// or an instance of a struct).
  ///
  /// The returned pointer will be stored with the node in the `user_data`
  /// field, accessible within prepare and invoke functions below.
  ///
  /// NOTE: if the data is already in the desired format, simply implement this
  /// function to return `nullptr` and implement the free function to be a
  /// no-op.
  void* (*init)(TfLiteContext* context, const char* buffer, size_t length);

  /// The pointer `buffer` is the data previously returned by an init
  /// invocation.
  void (*free)(TfLiteContext* context, void* buffer);

  /// prepare is called when the inputs this node depends on have been resized.
  /// `context->ResizeTensor()` can be called to request output tensors to be
  /// resized.
  /// Can be called multiple times for the lifetime of the op.
  ///
  /// Returns `kTfLiteOk` on success.
  TfLiteStatus (*prepare)(TfLiteContext* context, TfLiteNode* node);

  /// Execute the node (should read `node->inputs` and output to
  /// `node->outputs`).
  ///
  /// Returns `kTfLiteOk` on success.
  TfLiteStatus (*invoke)(TfLiteContext* context, TfLiteNode* node);

  /// `profiling_string` is called during summarization of profiling information
  /// in order to group executions together. Providing a value here will cause a
  /// given op to appear multiple times is the profiling report. This is
  /// particularly useful for custom ops that can perform significantly
  /// different calculations depending on their `user-data`.
  const char* (*profiling_string)(const TfLiteContext* context,
                                  const TfLiteNode* node);

  /// Builtin codes. If this kernel refers to a builtin this is the code
  /// of the builtin. This is so we can do marshaling to other frameworks like
  /// NN API.
  ///
  /// Note: It is the responsibility of the registration binder to set this
  /// properly.
  int32_t builtin_code;

  /// Custom op name. If the op is a builtin, this will be `null`.
  ///
  /// Note: It is the responsibility of the registration binder to set this
  /// properly.
  ///
  /// WARNING: This is an experimental interface that is subject to change.
  const char* custom_name;

  /// The version of the op.
  /// Note: It is the responsibility of the registration binder to set this
  /// properly.
  int version;

  /// The external (i.e. ABI-stable) version of `TfLiteRegistration`.
  /// Since we can't use internal types (such as `TfLiteContext`) for C API to
  /// maintain ABI stability.  C API user will provide `TfLiteOperator` to
  /// implement custom ops.  We keep it inside of `TfLiteRegistration` and use
  /// it to route callbacks properly.
  TfLiteOperator* registration_external;

  /// Retrieves asynchronous kernel.
  ///
  /// If the `async_kernel` field is nullptr, it means the operation described
  /// by this TfLiteRegistration object does not support asynchronous execution.
  /// Otherwise, the function that the field points to should only be called for
  /// delegate kernel nodes, i.e. `node` should be a delegate kernel node
  /// created by applying a delegate. If the function returns nullptr, that
  /// means that the underlying delegate does not support asynchronous execution
  /// for this `node`.
  struct TfLiteAsyncKernel* (*async_kernel)(TfLiteContext* context,
                                            TfLiteNode* node);

  /// Indicates if an operator's output may safely overwrite its inputs.
  /// See the comments in `TfLiteInPlaceOp`.
  uint64_t inplace_operator;
} TfLiteRegistration;

On microcontrollers, resources, especially flash memory (ROM) and RAM, are at a premium. While tflite-micro shares many similarities with Tensorflow Lite, there are some key differences that stand out.

Tracing the history of tflite-micro is somewhat tricky, as it has been moved across locations in the tensorflow/tensorflow repository, before eventually being moved to the tensorflow/tflite-micro repository where it resides today. The original commit (a7e8ad) was from Pete Warden in 2018, and it included an excellent document describing the motivations, trade-offs, and future improvements.

These motivations, such as avoiding the use of dynamic memory allocation, are manifested in the differences between tflite-micro and Tensorflow Lite. For example, tflite-micro now implements its own stripped down version of the OpResolver interface class named the MicroOpResolver, and a derived MicroMutableOpResolver template class. While the overuse of templates can lead to increased code size, in this case a non-type parameter (tOpCount) is helpfully used to assign the size of the array of registered operators at compile time rather than using a dynamically allocated std::unordered_map.

tensorflow/lite/micro/micro_mutable_op_resolver.h

namespace tflite {
TFLMRegistration* Register_DETECTION_POSTPROCESS();

template <unsigned int tOpCount>
class MicroMutableOpResolver : public MicroOpResolver {
 public:
  TF_LITE_REMOVE_VIRTUAL_DELETE

  explicit MicroMutableOpResolver() {}

  const TFLMRegistration* FindOp(tflite::BuiltinOperator op) const override {
    if (op == BuiltinOperator_CUSTOM) return nullptr;

    for (unsigned int i = 0; i < registrations_len_; ++i) {
      const TFLMRegistration& registration = registrations_[i];
      if (registration.builtin_code == op) {
        return &registration;
      }
    }
    return nullptr;
  }

…

  // Registers a Custom Operator with the MicroOpResolver.
  //
  // Only the first call for a given name will be successful. i.e. if this
  // function is called again for a previously added Custom Operator, the
  // MicroOpResolver will be unchanged and this function will return
  // kTfLiteError.
  TfLiteStatus AddCustom(const char* name,
                         const TFLMRegistration* registration) {
    if (registrations_len_ >= tOpCount) {
      MicroPrintf(
          "Couldn't register custom op '%s', resolver size is too"
          "small (%d)",
          name, tOpCount);
      return kTfLiteError;
    }

    if (FindOp(name) != nullptr) {
      MicroPrintf("Calling AddCustom for the same op more than once ");
      MicroPrintf("is not supported (Op: %s).", name);
      return kTfLiteError;
    }

    TFLMRegistration* new_registration = &registrations_[registrations_len_];
    registrations_len_ += 1;

    *new_registration = *registration;
    new_registration->builtin_code = BuiltinOperator_CUSTOM;
    new_registration->custom_name = name;
    return kTfLiteOk;
  }

  // The Add* functions below add the various Builtin operators to the
  // MicroMutableOpResolver object.

  TfLiteStatus AddAbs() {
    return AddBuiltin(BuiltinOperator_ABS, Register_ABS(), ParseAbs);
  }

  TfLiteStatus AddAdd(const TFLMRegistration& registration = Register_ADD()) {
    return AddBuiltin(BuiltinOperator_ADD, registration, ParseAdd);
  }

…


  TfLiteStatus AddWindow() {
    // TODO(b/286250473): change back name to "Window" and remove namespace
    return AddCustom("SignalWindow", tflite::tflm_signal::Register_WINDOW());
  }

  TfLiteStatus AddZerosLike() {
    return AddBuiltin(BuiltinOperator_ZEROS_LIKE, Register_ZEROS_LIKE(),
                      ParseZerosLike);
  }

  unsigned int GetRegistrationLength() { return registrations_len_; }

 private:
  TfLiteStatus AddBuiltin(tflite::BuiltinOperator op,
                          const TFLMRegistration& registration,
                          TfLiteBridgeBuiltinParseFunction parser) {
    if (op == BuiltinOperator_CUSTOM) {
      MicroPrintf("Invalid parameter BuiltinOperator_CUSTOM to the ");
      MicroPrintf("AddBuiltin function.");
      return kTfLiteError;
    }

    if (FindOp(op) != nullptr) {
      MicroPrintf("Calling AddBuiltin with the same op more than ");
      MicroPrintf("once is not supported (Op: #%d).", op);
      return kTfLiteError;
    }

    if (registrations_len_ >= tOpCount) {
      MicroPrintf("Couldn't register builtin op #%d, resolver size ", op);
      MicroPrintf("is too small (%d).", tOpCount);
      return kTfLiteError;
    }

    registrations_[registrations_len_] = registration;
    // Strictly speaking, the builtin_code is not necessary for TFLM but
    // filling it in regardless.
    registrations_[registrations_len_].builtin_code = op;
    registrations_len_++;

    builtin_codes_[num_buitin_ops_] = op;
    builtin_parsers_[num_buitin_ops_] = parser;
    num_buitin_ops_++;

    return kTfLiteOk;
  }

  TFLMRegistration registrations_[tOpCount];
  unsigned int registrations_len_ = 0;

  // Arrays (and counter) to store the builtin codes and their corresponding
  // parse functions as these are registered with the Op Resolver.
  BuiltinOperator builtin_codes_[tOpCount];
  TfLiteBridgeBuiltinParseFunction builtin_parsers_[tOpCount];
  unsigned int num_buitin_ops_ = 0;
};

};  // namespace tflite

We’ve seen the MicroMutableOpResolver in action in previous posts when looking at registering operators in examples, such as the person_detection application. Let’s take a look at the even simpler hello_world example, which registers a single operator (FULLY_CONNECTED). Because we actually want to (finally!) build an application that targets a real microcontroller, we’ll use the Zephyr RTOS variant of the example. The instantiation of the MicroMutableOpResolver takes place in the setup() function.

samples/modules/tflite-micro/hello_world/src/main_functions.cpp

void setup(void)
{
	/* Map the model into a usable data structure. This doesn't involve any
	 * copying or parsing, it's a very lightweight operation.
	 */
	model = tflite::GetModel(g_model);
	if (model->version() != TFLITE_SCHEMA_VERSION) {
		MicroPrintf("Model provided is schema version %d not equal "
					"to supported version %d.",
					model->version(), TFLITE_SCHEMA_VERSION);
		return;
	}

	/* This pulls in the operation implementations we need.
	 * NOLINTNEXTLINE(runtime-global-variables)
	 */
	static tflite::MicroMutableOpResolver <1> resolver;
	resolver.AddFullyConnected();

	/* Build an interpreter to run the model with. */
	static tflite::MicroInterpreter static_interpreter(
		model, resolver, tensor_arena, kTensorArenaSize);
	interpreter = &static_interpreter;

	/* Allocate memory from the tensor_arena for the model's tensors. */
	TfLiteStatus allocate_status = interpreter->AllocateTensors();
	if (allocate_status != kTfLiteOk) {
		MicroPrintf("AllocateTensors() failed");
		return;
	}

	/* Obtain pointers to the model's input and output tensors. */
	input = interpreter->input(0);
	output = interpreter->output(0);

	/* Keep track of how many inferences we have performed. */
	inference_count = 0;
}

We’ll build the example for an Arm Cortex-M33 microcontroller, namely the Nordic Semiconductor nRF9160 on the nRF9160 Development Kit (DK). After following the Zephyr setup instructions, the sample can be built for our target with the following command.

west build -p -b nrf9160dk/nrf9160/ns samples/modules/tflite-micro/hello_world

While this is an extremely minimal use case, we can see that the application ROM footprint, which includes both the tflite-micro components and the Zephyr kernel, is fairly small (56 KB).

Memory region         Used Size  Region Size  %age Used
           FLASH:       56604 B       192 KB     28.79%
             RAM:        7608 B       168 KB      4.42%
        IDT_LIST:          0 GB        32 KB      0.00%

We can use Zephyr’s rom_report target (-t rom_report) to see how much tflite-micro, and specifically operator resolution, is contributing to the code size.

Scroll right to see size and % of ROM used.

    ├── optional                                                                                14948  26.42%  - 
    │   └── modules                                                                             14948  26.42%  - 
    │       └── lib                                                                             14948  26.42%  - 
    │           └── tflite-micro                                                                14948  26.42%  - 
    │               ├── tensorflow                                                              14788  26.14%  - 
    ...
    │               │   └── lite                                                                14678  25.94%  - 
    ...
    │               │       ├── kernels                                                          1422   2.51%  - 
    │               │       │   ├── internal                                                      782   1.38%  - 
    │               │       │   │   ├── common.cc                                                 260   0.46%  - 
    │               │       │   │   │   ├── _ZN6tflite29MultiplyByQuantizedMultiplierEiii         132   0.23%  0x00054acd text
    │               │       │   │   │   └── _ZN6tflite29MultiplyByQuantizedMultiplierExii         128   0.23%  0x00054b51 text
    │               │       │   │   ├── portable_tensor_utils.cc                                   54   0.10%  - 
    │               │       │   │   │   └── _ZN6tflite12tensor_utils23UnpackDenseInt4IntoInt8EPKaiPa 54   0.10%  0x0005a313 text
    │               │       │   │   ├── quantization_util.cc                                      116   0.21%  - 
    │               │       │   │   │   └── _ZN6tflite18QuantizeMultiplierEdPiS0_                 116   0.21%  0x00054bd1 text
    │               │       │   │   ├── reference                                                 256   0.45%  - 
    │               │       │   │   │   └── integer_ops                                           256   0.45%  - 
    │               │       │   │   │       └── fully_connected.h                                 256   0.45%  - 
    │               │       │   │   │           └── _ZN6tflite21reference_integer_ops14FullyConnectedIaaaiEEvRKNS_20FullyConnectedParamsERKNS_12RuntimeShapeEPKT_S7_PKT0_S7_PKT2_S7_PT1_.isra.0 256   0.45%  0x0005a3d5 text
    ...
    │               │       ├── micro                                                           12652  22.36%  - 
    ...
    │               │       │   ├── kernels                                                      2524   4.46%  - 
    │               │       │   │   ├── fully_connected.cc                                       1858   3.28%  - 
    │               │       │   │   │   ├── _ZN6tflite12_GLOBAL__N_118FullyConnectedEvalEP13TfLiteContextP10TfLiteNode 1380   2.44%  0x0005514d text
    │               │       │   │   │   ├── _ZN6tflite12_GLOBAL__N_118FullyConnectedInitEP13TfLiteContextPKcj 18   0.03%  0x0005a395 text
    │               │       │   │   │   ├── _ZN6tflite12_GLOBAL__N_121FullyConnectedPrepareEP13TfLiteContextP10TfLiteNode 420   0.74%  0x00054f81 text
    │               │       │   │   │   └── _ZN6tflite24Register_FULLY_CONNECTEDEv                 40   0.07%  0x00055125 text
    │               │       │   │   ├── fully_connected_common.cc                                 502   0.89%  - 
    │               │       │   │   │   ├── _ZN6tflite25FullyConnectedParamsFloatE21TfLiteFusedActivation 56   0.10%  0x000556b1 text
    │               │       │   │   │   ├── _ZN6tflite29CalculateOpDataFullyConnectedEP13TfLiteContext21TfLiteFusedActivation10TfLiteTypePK12TfLiteTensorS6_S6_PS4_PNS_20OpDataFullyConnectedE 412   0.73%  0x000556e9 text
    │               │       │   │   │   └── _ZN6tflite29FullyConnectedParamsQuantizedERKNS_20OpDataFullyConnectedE 34   0.06%  0x0005a4d5 text
    │               │       │   │   ├── kernel_util.cc                                            152   0.27%  - 
    │               │       │   │   │   ├── _ZN6tflite5micro10RegisterOpEPFPvP13TfLiteContextPKcjEPF12TfLiteStatusS3_P10TfLiteNodeESC_PFvS3_S1_ESE_ 24   0.04%  0x0005a4f7 text
    │               │       │   │   │   ├── _ZN6tflite5micro12GetEvalInputEPK13TfLiteContextPK10TfLiteNodei 4   0.01%  0x0005a545 text
    │               │       │   │   │   ├── _ZN6tflite5micro13GetEvalOutputEPK13TfLiteContextPK10TfLiteNodei 28   0.05%  0x0005a549 text
    │               │       │   │   │   ├── _ZN6tflite5micro14GetTensorShapeEPK16TfLiteEvalTensor  42   0.07%  0x0005a565 text
    │               │       │   │   │   └── _ZN6tflite5micro19GetMutableEvalInputEPK13TfLiteContextPK10TfLiteNodei 54   0.10%  0x0005a50f text
    │               │       │   │   └── kernel_util.h                                              12   0.02%  - 
    │               │       │   │       └── _ZN6tflite5micro13GetTensorDataIaEEPKT_PK16TfLiteEvalTensor 12   0.02%  0x0005a389 text
    ...
    │               │       │   ├── micro_mutable_op_resolver.h                                   128   0.23%  - 
    │               │       │   │   ├── _ZN6tflite22MicroMutableOpResolverILj1EED0Ev               14   0.02%  0x00058541 text
    │               │       │   │   ├── _ZN6tflite22MicroMutableOpResolverILj1EED2Ev                2   0.00%  0x000584e7 text
    │               │       │   │   ├── _ZNK6tflite22MicroMutableOpResolverILj1EE15GetOpDataParserENS_15BuiltinOperatorE 30   0.05%  0x000584e9 text
    │               │       │   │   ├── _ZNK6tflite22MicroMutableOpResolverILj1EE6FindOpENS_15BuiltinOperatorE 24   0.04%  0x0005854f text
    │               │       │   │   └── _ZNK6tflite22MicroMutableOpResolverILj1EE6FindOpEPKc       58   0.10%  0x00058507 text
    │               │       │   ├── micro_op_resolver.cc                                          148   0.26%  - 
    │               │       │   │   └── _ZN6tflite25GetRegistrationFromOpCodeEPKNS_12OperatorCodeERKNS_15MicroOpResolverEPPK16TFLMRegistration 148   0.26%  0x0005466d text
    │               │       │   ├── micro_profiler.h                                               20   0.04%  - 
    │               │       │   │   └── _ZN6tflite19ScopedMicroProfilerD1Ev                        20   0.04%  0x00059835 text
    ...

The MicroMutableOpResolver member function for registering the FULLY_CONNECTED operator is structured as follows.

tensorflow/lite/micro/micro_mutable_op_resolver.h

  TfLiteStatus AddFullyConnected(
      const TFLMRegistration& registration = Register_FULLY_CONNECTED()) {
    return AddBuiltin(BuiltinOperator_FULLY_CONNECTED, registration,
                      ParseFullyConnected);
  }

If we dump the body of the setup() function, we can see that, as expected due to being defined in the header file, AddFullyConnected and AddBuiltin are inlined and we end up calling Register_FULLY_CONNECTED directly.

~/.local/zephyr-sdk-0.17.0/arm-zephyr-eabi/bin/arm-zephyr-eabi-objdump -D build/zephyr/zephyr.elf

Portions of the <setup> dump are removed for brevity.

0005191c <setup>:
   5191c:	b5f0      	push	{r4, r5, r6, r7, lr}
   5191e:	4b47      	ldr	r3, [pc, #284]	; (51a3c <setup+0x120>)
   51920:	4f47      	ldr	r7, [pc, #284]	; (51a40 <setup+0x124>)
   51922:	6819      	ldr	r1, [r3, #0]
   51924:	b08d      	sub	sp, #52	; 0x34
   ...
   51972:	a805      	add	r0, sp, #20
   51974:	f003 fbd6 	bl	55124 <_ZN6tflite24Register_FULLY_CONNECTEDEv>
   51978:	2109      	movs	r1, #9
   5197a:	4834      	ldr	r0, [pc, #208]	; (51a4c <setup+0x130>)
   5197c:	f006 fde7 	bl	5854e <_ZNK6tflite22MicroMutableOpResolverILj1EE6FindOpENS_15BuiltinOperatorE>
   51980:	b358      	cbz	r0, 519da <setup+0xbe>
   ...
   519da:	6a2b      	ldr	r3, [r5, #32]
   519dc:	b133      	cbz	r3, 519ec <setup+0xd0>
   519de:	2109      	movs	r1, #9
   519e0:	4826      	ldr	r0, [pc, #152]	; (51a7c <setup+0x160>)
   519e2:	f008 fa4e 	bl	59e82 <_Z11MicroPrintfPKcz>
   519e6:	2101      	movs	r1, #1
   519e8:	4825      	ldr	r0, [pc, #148]	; (51a80 <setup+0x164>)
   519ea:	e7cf      	b.n	5198c <setup+0x70>
   519ec:	4e25      	ldr	r6, [pc, #148]	; (51a84 <setup+0x168>)
   519ee:	ac05      	add	r4, sp, #20
   519f0:	cc0f      	ldmia	r4!, {r0, r1, r2, r3}
   519f2:	c60f      	stmia	r6!, {r0, r1, r2, r3}
   519f4:	e894 0007 	ldmia.w	r4, {r0, r1, r2}
   519f8:	2301      	movs	r3, #1
   519fa:	e886 0007 	stmia.w	r6, {r0, r1, r2}
   519fe:	2209      	movs	r2, #9
   51a00:	622b      	str	r3, [r5, #32]
   51a02:	6aeb      	ldr	r3, [r5, #44]	; 0x2c
   51a04:	61aa      	str	r2, [r5, #24]
   51a06:	eb05 0183 	add.w	r1, r5, r3, lsl #2
   51a0a:	624a      	str	r2, [r1, #36]	; 0x24
   51a0c:	491e      	ldr	r1, [pc, #120]	; (51a88 <setup+0x16c>)
   51a0e:	f103 020a 	add.w	r2, r3, #10
   51a12:	3301      	adds	r3, #1
   51a14:	f845 1022 	str.w	r1, [r5, r2, lsl #2]
   51a18:	62eb      	str	r3, [r5, #44]	; 0x2c
   51a1a:	e7b9      	b.n	51990 <setup+0x74>
   ...
   51a38:	b00d      	add	sp, #52	; 0x34
   51a3a:	bdf0      	pop	{r4, r5, r6, r7, pc}
   51a3c:	0005afa8 	andeq	sl, r5, r8, lsr #31
   51a40:	20016384 	andcs	r6, r1, r4, lsl #7
   51a44:	0005c2e8 	andeq	ip, r5, r8, ror #5
   51a48:	20016340 	andcs	r6, r1, r0, asr #6
   51a4c:	20016344 	andcs	r6, r1, r4, asr #6
   51a50:	0005bbfc 	strdeq	fp, [r5], -ip
   51a54:	20016388 	andcs	r6, r1, r8, lsl #7
   51a58:	000584e7 	andeq	r8, r5, r7, ror #9
   51a5c:	0005c32f 	andeq	ip, r5, pc, lsr #6
   51a60:	0005c35e 	andeq	ip, r5, lr, asr r3
   51a64:	20016278 	andcs	r6, r1, r8, ror r2
   51a68:	2001649c 	mulcs	r1, ip, r4
   51a6c:	2001627c 	andcs	r6, r1, ip, ror r2
   51a70:	000595d9 	ldrdeq	r9, [r5], -r9	; <UNPREDICTABLE>
   51a74:	20016380 	andcs	r6, r1, r0, lsl #7
   51a78:	0005c3c3 	andeq	ip, r5, r3, asr #7
   51a7c:	0005c37f 	andeq	ip, r5, pc, ror r3
   51a80:	0005c3b0 			; <UNDEFINED> instruction: 0x0005c3b0
   51a84:	20016348 	andcs	r6, r1, r8, asr #6
   51a88:	00054e99 	muleq	r5, r9, lr
   51a8c:	2001637c 	andcs	r6, r1, ip, ror r3
   51a90:	20016378 	andcs	r6, r1, r8, ror r3
   51a94:	20016374 	andcs	r6, r1, r4, ror r3

Prior to calling Register_FULLY_CONNECTED(), we set aside a region on the stack (add r0, sp, #20) for storing the returned TFLMRegistration for the FULLY_CONNECTED operator. Register_FULLY_CONNECTED() subsequently calls the helper RegisterOp() function.

tensorflow/lite/micro/kernels/fully_connected.cc

TFLMRegistration Register_FULLY_CONNECTED() {
  return tflite::micro::RegisterOp(FullyConnectedInit, FullyConnectedPrepare,
                                   FullyConnectedEval);
}

The last three addresses in the Register_FULLY_CONNECTED() disassembly are not instructions, but rather pointers to the FullyConnectedPrepare (0x00054f80), FullyConnectedEval (0x000541c), and FullyConnectedInit (0x0005a394) functions. They are passed in registers to the RegisterOp() helper.

Note that the least significant bit in all three function addresses is 1, indicating the use of the Thumb compressed instruction set.

00055124 <_ZN6tflite24Register_FULLY_CONNECTEDEv>:
   55124:	2300      	movs	r3, #0
   55126:	b513      	push	{r0, r1, r4, lr}
   55128:	4604      	mov	r4, r0
   5512a:	e9cd 3300 	strd	r3, r3, [sp]
   5512e:	4a04      	ldr	r2, [pc, #16]	; (55140 <_ZN6tflite24Register_FULLY_CONNECTEDEv+0x1c>)
   55130:	4b04      	ldr	r3, [pc, #16]	; (55144 <_ZN6tflite24Register_FULLY_CONNECTEDEv+0x20>)
   55132:	4905      	ldr	r1, [pc, #20]	; (55148 <_ZN6tflite24Register_FULLY_CONNECTEDEv+0x24>)
   55134:	f005 f9df 	bl	5a4f6 <_ZN6tflite5micro10RegisterOpEPFPvP13TfLiteContextPKcjEPF12TfLiteStatusS3_P10TfLiteNodeESC_PFvS3_S1_ESE_>
   55138:	4620      	mov	r0, r4
   5513a:	b002      	add	sp, #8
   5513c:	bd10      	pop	{r4, pc}
   5513e:	bf00      	nop
   55140:	00054f81 	andeq	r4, r5, r1, lsl #31
   55144:	0005514d 	andeq	r5, r5, sp, asr #2
   55148:	0005a395 	muleq	r5, r5, r3

The RegisterOp() helper simply constructs and returns the TFLMRegistration.

tensorflow/lite/micro/kernels/kernel_util.cc

TFLMRegistration RegisterOp(
    void* (*init)(TfLiteContext* context, const char* buffer, size_t length),
    TfLiteStatus (*prepare)(TfLiteContext* context, TfLiteNode* node),
    TfLiteStatus (*invoke)(TfLiteContext* context, TfLiteNode* node),
    void (*free)(TfLiteContext* context, void* buffer),
    void (*reset)(TfLiteContext* context, void* buffer)) {
  return {/*init=*/init,
          /*free=*/free,
          /*prepare=*/prepare,
          /*invoke=*/invoke,
          /*reset*/ reset,
          /*builtin_code=*/0,
          /*custom_name=*/nullptr};
}

r0 still contains the address of the region on the stack that we made for the TFLMRegistration prior to calling Register_FULLY_CONNECTED(), so the passed function pointers and other data are stored at offsets relative the address.

0005a4f6 <_ZN6tflite5micro10RegisterOpEPFPvP13TfLiteContextPKcjEPF12TfLiteStatusS3_P10TfLiteNodeESC_PFvS3_S1_ESE_>:
   5a4f6:	b510      	push	{r4, lr}
   5a4f8:	60c3      	str	r3, [r0, #12]
   5a4fa:	9b03      	ldr	r3, [sp, #12]
   5a4fc:	6001      	str	r1, [r0, #0]
   5a4fe:	6103      	str	r3, [r0, #16]
   5a500:	2300      	movs	r3, #0
   5a502:	9902      	ldr	r1, [sp, #8]
   5a504:	e9c0 3305 	strd	r3, r3, [r0, #20]
   5a508:	e9c0 1201 	strd	r1, r2, [r0, #4]
   5a50c:	bd10      	pop	{r4, pc}

When returning back out to setup(), the TFLMRegistration on the stack needs to be copied to the resolver registrations_ array. The resolver was instantiated with static linkage, meaning it will not be stored on the stack. This is necessary as the lifetime of the OpResvoler used with a tflite-micro interpreter needs to have a lifetime at least as long as the interpreter itself.

The resolver exists in memory next to the static_interpreter.

20016278 <_ZGVZ5setupE18static_interpreter>:
20016278:	00000000 	andeq	r0, r0, r0

2001627c <_ZZ5setupE18static_interpreter>:
	...

20016340 <_ZGVZ5setupE8resolver>:
20016340:	00000000 	andeq	r0, r0, r0

20016344 <_ZZ5setupE8resolver>:
	...

The bottom section of the setup() dump includes addresses of these objects. After constructing the TFLMRegistration on the stack, we subsequently copy it to the registrations_ member array of the resolver.

   519ec:	4e25      	ldr	r6, [pc, #148]	; (51a84 <setup+0x168>)
   519ee:	ac05      	add	r4, sp, #20
   519f0:	cc0f      	ldmia	r4!, {r0, r1, r2, r3}
   519f2:	c60f      	stmia	r6!, {r0, r1, r2, r3}
   519f4:	e894 0007 	ldmia.w	r4, {r0, r1, r2}
   519f8:	2301      	movs	r3, #1
   519fa:	e886 0007 	stmia.w	r6, {r0, r1, r2}

Now that we have an idea of how each operator is registered, we can see why only registering those that are stricly necessary can have a significant impact on whether or not performing inference on microcontrollers is feasible. For example, additionally registering the five operators (AVERAGE_POOL_2D, CONV_2D, DEPTHWISE_CONV_2D, RESHAPE, and SOFTMAX) used in the person_detection example from our last post to our resolver increases ROM usage by more than 20 KB.

Memory region         Used Size  Region Size  %age Used
           FLASH:       76708 B       192 KB     39.02%
             RAM:        7792 B       168 KB      4.53%
        IDT_LIST:          0 GB        32 KB      0.00%

We’ve focused on optimizing the footprint of the tflite-micro framework in this post, while happily ignoring the impact of the size and structure of the model itself. Typically, models will be much larger than the 2488 bytes occupied by the simple one used in the hello_world example. In upcoming posts, we’ll start to look at model optimization techniques and see how they impact both size and runtime performance.