armnn icon indicating copy to clipboard operation
armnn copied to clipboard

Using multi-cores in parallel for ARMNN v21.05

Open supratimc239 opened this issue 2 years ago • 9 comments

Hi,

I am using taskset with bitmap value 7 (configuring core 0, 1 and 2) while running my executable VdaNet on a octa-core device. But when I look at the perfetto trace I can see that only the executable was mainly scheduled on core 1. Just for information, the network was created with 'CpuAcc' option and I can see from the prints added in the code that Neon optimization has been implemented on the back-end.

My question is without explicitly introducing multi-threading by source code changes in the code, is there any way we could configure ARMNN/ACL so that I can run my executable on more than one core in parallel?

image

supratimc239 avatar Mar 29 '22 06:03 supratimc239

@morgolock can you help at all here? Might the Compute Library scheduler have conflicts with taskset?

MatthewARM avatar Mar 29 '22 13:03 MatthewARM

Hi @supratimc239

Would you please try running your binary without taskset and share the results with us?

At lower level ACL will try to detect the number of cores and run the kernels on multiple cores. If you have 8 cores, ACL will schedule work on all your cores and try to keep the CPU as busy as possible.

If this is not happening there may be a problem in the CPU detection code in ACL. What OS are you running on? It looks like Android.

morgolock avatar Mar 29 '22 14:03 morgolock

Hi @morgolock ,

Thank you for your email. I have executed my model without enabling CPU affinity and following is the new perfetto trace. As you can see, the application is mostly executed on CPU7, only for a brief duration it was scheduled on CPU4. But never it was executed on more than one CPU at a time.

image

Also, could you please tell why at the very beginning the model runs on CPU1 and 2? I have seen this also happening while using core affinity bitmap.

Yes, I am running my application on Android. Thanks

supratimc239 avatar Mar 29 '22 23:03 supratimc239

Hi @supratimc239

I've just noticed you are using 21.05, could you please update to 22.02 and try again?

Can you provide the following information:

  • How do you run the model? With ExecuteNetwork? Can you share the command to run the model?
  • Have you tried arm_compute_validation test? This program will print the CPU cores detected by ACL before running. In this way we could see if ACL is actually detecting 8 cores or not.
  • Can you provide more details about your Android version?

For more details about arm_compute_validation please see https://arm-software.github.io/ComputeLibrary/latest/tests.xhtml#tests_running_tests

Hope this helps.

morgolock avatar Mar 30 '22 09:03 morgolock

Hi @morgolock,

Thank you for your suggestion.

I upgraded my ACL and ARM-NN to v22.02, but the ARM-NN build is failing. Could you please have a look at my latest comment in https://github.com/ARM-software/armnn/issues/607 and see why the ARM-NN build is failing.

Once I can execute my model on v22.02 will start checking if parallel multi-core execution is working or not.

Thanks

supratimc239 avatar Apr 05 '22 00:04 supratimc239

Hi @morgolock ,

I have upgraded my ACL and ARM-NN to v22.02 and still from perfetto trace don't see parallel execution of my network in multiple cores. Even though when I don't set affinity it is executing on multiple cores, it is always executing on one core at a time.

no_affinity

I saw the same when I set task set bitmap to 0x70 (all middle cores):

medium_core

And when I set bitmap to 0x0F (all small cores): small_core

I am yet to run arm_compute_validation test, but without running that any idea why the network is not executed on multiple cores in parallel? I run my model using the ExecutionNetwork framework. The device I am testing using Android version 12 and I am building amrnn using Android NDK version r20b.

Thanks

supratimc239 avatar Apr 06 '22 07:04 supratimc239

I'm thinking that the number of cores detection code in Compute Library might be going wrong on this device. There is a way for Arm NN to override that detection, but it's not well documented. It's used here:

https://github.com/ARM-software/android-nn-driver/blob/558a1d4ed904f5f7d04781bc3405ee77669563d0/1.3/ArmnnDriverImpl.cpp#L200

@supratimc239 I would suggest either try arm_compute_validation as suggested by @morgolock to confirm this, or try to use the ModelOptions API as above to set the number of threads explicitly.

From your other issue, it sounds like you have three core clusters, and one of them has 1 core. I'm vaguely suspicious that maybe the Compute Library is just detecting the 1-core cluster. I may be comptletely wrong.

MatthewARM avatar Apr 06 '22 08:04 MatthewARM

Hi @MatthewARM ,

Result of overriding core detection: As per your suggestion I added "NumberOfThreads" while configuring BackEnd to overwide the default detection. auto backendOptions = armnn::BackendOptions{"CpuAcc", { {"FastMathEnabled", true}, {"NumberOfThreads", 20} } }

I tested with three separate values (0 - default, 10 and 20). But unfortunately for none of the the cases I could see parallel execution on Multiple cores. Also, another thing which is strange is that for none of the cases the model is executing on core#7. This is the perfetto trace for default case. Thread#default

This is the perfetto trace for threads#10 Thread#10

This is the perfetto trace for threads#20 Thread#20

I am wondering if I have to use any special option to enable threading in ACL or ARMNN build. Following is my ComputeLibrary build command: scons arch=arm64-v8a neon=1 opencl=1 embed_kernels=1 extra_cxx_flags="-fPIC"
benchmark_tests=0 validation_tests=0 os=android

arm_compute_validation tests: Could you please provide me more information on arm_compute_validation - how to build, load and execute these validations tests and how one can find out CPU cores detected by ACL.

Also, if possible please have a look at https://github.com/ARM-software/armnn/issues/635. I am puzzled why v22.02 library is taking more time than v21.05. Am I missing some options which will make the model execution faster?

Thanks

supratimc239 avatar Apr 07 '22 08:04 supratimc239

Hi @MatthewARM and @morgolock ,

I have done some more experiments with threads in v21.05 (I reverted back since v22.02 processing time is much higher than v21.05 as mentioned in https://github.com/ARM-software/armnn/issues/635).

This is what I see:

  1. From the perfetto traces (attached pic of the perfetto traces for default, thread#20 and thread#64 below) it is clear that we are using multi-threaded execution. You would be able to see parallel execution on different core for the cases where I explicitly provide number of threads, but no parallel execution for default case. Perfetto trace (default): Thread#default

Perfetto trace (number of threads = 2) Thread#20 0):

Perfetto trace (number of threads = 64): Thread#64

  1. Processing time however increases as we increase the number of thread Default Thread#20 Thread#64 ------- --------- --------- Processing time 258.801 9,809.855 13,079.96 (in ns)

Could you please tell, why despite of multi-core execution, instead of processing time decrease there is an increase in processing time. Ideally you would expect processing time to come down.

Thanks

supratimc239 avatar Apr 13 '22 01:04 supratimc239

Hi @supratimc239

Is this still a problem for you?

Could you please provide me more information on arm_compute_validation - how to build, load and execute these validations tests and how one can find out CPU cores detected by ACL.

To build the validation suite you just need to use the option validation_tests=1 when building ACL. For example cons os=linux opencl=0 asserts=1 examples=0 neon=1 arch=armv8a validation_tests=1 -j8

The when you run the tests, at the beginning of the execution ACL will detect and print the number of CPU cores present and cpu features (bf16,dotprod,sve,sve2,ect)

LD_LIBRARY_PATH=.:$LD_LIBRARY_PATH ./arm_compute_validation --stop-on-error
Version = arm_compute_version=v0.0-unreleased Build options: {'os': 'linux', 'opencl': '1', 'asserts': '1', 'examples': '1', 'neon': '1', 'arch': 'armv8a', 'benchmark_examples': '0', 'multi_isa': '0', 'debug': '0', 'validation_tests': '1', 'test_filter': 'FFT.cpp'} Git hash=b'e37c31809294b62ff729e7dc49b21189b3e3262f'
CommandLine = ./arm_compute_validation --stop-on-error 
Seed = 2554607086
CL_DEVICE_VERSION = OpenCL 2.0 not_released.51d50be.1502459415db9c37cbcbea279386cb09
cpu_has_sve = false
cpu_has_sve2 = false
cpu_has_svef32mm = false
cpu_has_svei8mm = false
cpu_has_svebf16 = false
cpu_has_sme = false
cpu_has_sme2 = false
cpu_has_fp16 = false
cpu_has_bf16 = false
cpu_has_dotprod = false
cpu_has_i8mm = false
CPU0 = A73
CPU1 = A73
CPU2 = A73
CPU3 = A73
Iterations = 1
Threads = 1
Dataset mode = PRECOMMIT
Running [0] 'UNIT/CPPScheduler/RethrowException'
  Wall clock/Wall clock time:    AVG=3892.0000 us
Running [1] 'UNIT/GPUTarget/GetGPUTargetFromName'
  Wall clock/Wall clock time:    AVG=284.0000 us
Running [2] 'UNIT/GPUTarget/GPUTargetIsIn'
  Wall clock/Wall clock time:    AVG=0.0000 us
Running [3] 'UNIT/LifetimeManager/MemoryGroupRegister

For more information about how to build ACL please see the documentation for more information. https://arm-software.github.io/ComputeLibrary/latest/how_to_build.xhtml

morgolock avatar Dec 18 '23 17:12 morgolock