Benchmarking

Every model bundle on AI2GO has been run on a reference device to determine its speed and memory usage. These benchmark numbers are used to populate the AI2GO model bundle tuner.

Procedure

Model bundle performance is measured as follows:

  1. The model_benchmark sample (C version) is compiled for the target device using the Makefile distributed in the SDK.

  2. The libxnornet.so (C bindings shared object) for the model bundle in question is placed alongside the compiled model_benchmark sample so that it will be used in the benchmark.

  3. The model_benchmark sample is invoked with the following arguments:

    • --warm-up-iterations=20: Evaluate the model 20 times before measuring. This helps control for non-deterministic timing due to features of modern CPUs such as memory / instruction caching and branch prediction.

    • --max-benchmark-iterations=100: Evaluate the model 100 times.

    • --max-benchmark-duration=999999999: Disable time limit to ensure that each model is evaluated for the full 100 cycles.

Measured values

The model_benchmark sample measures the lowest latency (fastest) model evaluation in order to give a reasonably reliable bound on model speed. Since many factors (such as CPU load, heat throttling, and/or memory pressure) can affect execution time adversely, the minimum latency value is used to increase the reliability of benchmark results. This value provides an estimate of the best-case evaluation speed of the model bundle.

The sample also measures the highest measured resident set size (RSS) of the model benchmark sample process. Resident set size is a measure of the amount of RAM allocated by the program which is currently held in main memory, i.e. not held in swap space or on a persistent storage device. For many of the hardware targets this is equivalent to the total memory usage of the program. On other targets (such as Mac or Linux), it nevertheless provides a good approximation of how much RAM the program is actively using at any one time. By measuring the maximum RSS value during the lifetime of the benchmark program, a good estimate for the memory requirements of the model bundle can be found, since the benchmark program itself has little overhead.

Note

Performance metrics are measured using the C bindings as a baseline. Other language bindings are expected to have some overhead, so (for example) the Python version of the model_benchmark sample may measure a higher RSS value and slightly higher latency than the C version.

Reference devices

For each hardware target, the benchmark is run on a specific reference device to obtain the numbers displayed in the model selector on AI2GO:

Raspberry Pi Zero

Off-the-shelf Raspberry Pi Zero

  • CPU: Single-core ARM1176 (ARMv6) @ 700MHz

  • RAM: 512MB

  • OS: Raspbian GNU/Linux 9 (Stretch)

Raspberry Pi 3

Off-the-shelf Raspberry Pi 3 B+

  • CPU: Four-core ARM Cortex-A53 (ARMv8) @ 1.4GHz

  • RAM: 1GB

  • OS: Raspbian GNU/Linux 9 (Stretch)

Toradex Apalis iMX6

Apalis iMX6Q 1GB

  • CPU: Four-core ARM Cortex-A9 (ARMv7-A) @ 996MHz

  • RAM: 1GB DDR3

  • OS: Toradex Embedded Linux (Yocto-based)

Linux x86-64

ASUS PRO P5440UF-XB74 Laptop

  • CPU: Intel Core i7-8550U @ 1.80GHz

  • RAM: 15.55GB DDR4

  • OS: Ubuntu 16.04.5 LTS