armnn Odroid N2+ @ 2.2 GHz slower than Raspberry Pi 4 @ 1.5 GHz

I've noticed a curious thing when benchmarking ResNet50 via ArmNN v21.11 with the Neon backend on Odroid N2+ @ 2208 MHz and Raspberry Pi 4 @ 1500 MHz. Despite the 46% higher clock frequency, N2+ is actually 10% slower than RPi4: 342 ms vs 315 ms. During the execution, N2+ only uses its 4 big cores at ~50% and does not use its 2 LITTLE cores at all, while RPi4 uses its 4 big cores at ~100%.

You can follow this Jupyter notebook to reproduce with the following updates to the Performance measurement commands:

ck run cmdgen:benchmark.image-classification.tflite-loadgen --verbose \
--model=resnet50 --scenario=singlestream --mode=performance \
--library=armnn-v21.11-neon --sut=odroid --target_latency=340

ck run cmdgen:benchmark.image-classification.tflite-loadgen --verbose \
--model=resnet50 --scenario=singlestream --mode=performance \
--library=armnn-v21.11-neon --sut=rpi4coral--target_latency=310

Dec 16 '21 10:12 psyhtest

I've got another interesting observation comparing two Raspberry Pi 4's: 322 ms on a new Model B Rev 1.4 8GB @ 1800 MHz vs 315 ms on a Model B Rev 1.1 4GB @ 1500 MHz. It seems the higher the frequency, the worse the latency gets. On the new RPi4, utilization is about ~93%.

Jan 03 '22 22:01 psyhtest

Thanks Anton - would you be able to attach the Arm NN event profiles from the two runs? Perhaps for some reason sub-optimal kernels are being selected, and we should be able to see that in the profiles.

Jan 04 '22 10:01 MatthewARM

@MatthewARM, how do I dump event profiles? Can I do it from a release build?

Jan 23 '22 22:01 psyhtest

Sorry Anton, somehow I missed this message.

If you're still curious:

I think it's enabled with the "-e" option to ExecuteNetwork, if that's the tool being used for the benchmark? It's absolutely available in a release build.

If CK is using the Arm NN API then there's good instructions in the answer #464

Hope that helps, Matthew

Jul 04 '22 09:07 MatthewARM

Thank you @MatthewARM. For the official MLPerf Inference v2.0 submission, we measured 339 ms for Odroid N2+ and 349 ms for Raspberry Pi 4. Seems like a RPi4 regression to me (314/349 => -10%).

Jul 11 '22 15:07 psyhtest

Hi @psyhtest,

Thank you for getting in touch. As this was a while ago, it's quite possible that the regression could have been fixed or even improved upon as there are always optimizations being made to the Neon backend.

Would it be possible to run the tests using the latest version of Arm NN and Arm Compute Library with profiling enabled? We would be more than happy to take a look at your profiling results if it's still occurring, as Matthew had mentioned above, to see if it's related to a sub-optimal kernels being selected.

Kind regards,

Matthew

Feb 03 '23 15:02 matthewsloyanARM

I'm wondering whether different memory bandwidth across these chips could explain these differences.

Feb 03 '23 15:02 MatthewARM

Closed due to inactivity, if this is still an issue for you can you please reopen the issue or create a new one. Best regards, Mike.

Aug 14 '23 12:08 MikeJKelly