executorch
executorch copied to clipboard
Query regarding support of Executorch for ARM Ethos-U65 backend
Hi,
I have started working with Executorch and in this section of launching Executorch on ARM Ethos-U, I have observed that only Ethos-U55 and Ethos-U85 are mentioned. Also in the model conversion script and setup utilities, the supported targets only contain Ethos-U55 and Ethos-U85 variants
But I am trying to work with a Ethos-U65 based system, so does that mean Executorch only supports the above mentioned variants or does it also support Ethos-U65?
cc @digantdesai @freddan80 @per @zingo @oscarandersson8218
@zingo @Erik-Lundell @digantdesai can any of you answer?
Hi @vikasbalaga , thx for your interest in Executorch and Ethos-U 🥇 Ethos-U65 is supported with Executorch as well, but we haven't given it too much love, yet. A couple of reasons; 1) there's no FVP to test it on 2) the AOT flow is very similar to Ethos-U55. Ethos-U65 is supported conceptually, it just needs some plumbing. For example the list you mention (model conversion script) and ArmCompileSpecBuilder. There's are some more places as well. If you are happy to give it a go, we can support you. Just push a PR and tag us (@digantdesai @freddan80 @per @zingo @oscarandersson8218 ). We'll give Ethos-U65 more attention medium term future.
The runtime flow is slightly different. Ethos-U65 sits on an "ML island" (Cortex-M + Ethos-U subsystem, embedded) as part of a larger system (Cortex-A, rich OS). That means Executorch runtime should be on the ML island, and your application calling into Executorch runtime needs to communicate somehow with the Cortex-A system. That mean of communication could build on e.g. ethos-u-linux-driver-stack. Some (not too big) modifications will probably be needed for Executorch workloads.
Hope this helps 👍
@freddan80 and others, thanks for your quick response.
The runtime flow is slightly different. Ethos-U65 sits on an "ML island" (Cortex-M + Ethos-U subsystem, embedded) as part of a larger system (Cortex-A, rich OS)...
Yes in my case it is (Cortex-A55, OS) and (Cortex-M33 + Ethos-U65) ML island and also I have a hardware setup available, so I don't need FVP
There's are some more places as well. If you are happy to give it a go, we can support you
Yes, I am interested in trying it. I could find the following places, which require modification :
- model conversion script
- ArmCompileSpecBuilder
- run (It looks like some config is being done, not sure what it will be for Ethos-U65)
So, could you help me in finding other modifications that are required?
Also, (I think this is a naive question), will this Executorch implementation work for my CPU Cortex-M33?
So, could you help me in finding other modifications that are required?
I'd start with those and debug from there. The important thing is that the call to vela argument looks right. (we can help checking that)
run (It looks like some config is being done, not sure what it will be for Ethos-U65)
The run.sh script will AOT compile, build and run an inference using the FVP. In your case you'd probably be happy just generating a .pte (with arm_aot_compile.py) AOT, then use build_executorch_runner.sh to build modified under the hood to your application code, linker script and startup code (rather than the Corstone-FVP's) to produce an .elf. Or perhaps, maybe it's even better to modify the cmake build to adapt to your setup.
Note that you'd want to use a 'vela.ini' file that fits your system config, and provide that to build_executorch_runner.sh. Probably you have such a file already with you dev board SDK?
@freddan80 ,
I have modified the arm_aot_compile.py and I think I am able to generate *.pte model for Ethos U65 backend. The configuration details I picked based on my hardware type. (I have forked the repo and committed my changes in a private branch for your reference)
then use build_executorch_runner.sh to build modified under the hood to your application code....
I tried modifying the build_executorch_runner.sh for my system (Cortex-M33 CPU and Ethos U65 NPU) but here I am observing cmake errors
CMake Error at CMakeLists.txt:408 (message):
Unsupported SYSTEM_CONFIG: Ethos_U65_High_End
I tried to debug it, but I couldn't understand how to update the "NPU timing adapters" as per Ethos U65 requirements. Also, it looks like we need to specify a TARGET_BOARD which corresponds to Corstone-300/320 but since in this case there is no simulator, how can I set that macro?
I have forked the repo and committed my changes in a private branch for your reference
Would it be possible to share the changes?
I assigned this to @AdrianLundell. He'll help you.
Hi, nice work so far!
The examples/arm/executor_runner-code and related CMake-scripts used to built it should be viewed as an example to get you started when building your own application. The build_executorch_runner.sh script and all flags containing target specific info in the runtime flow is there to make this example convenient to run and to help our testing, rather than being an official API.
For example, the timing adapters and related macros TARGET_BOARD, SYSTEM_CONFIG and MEMORY_MODE which you mention are only relevant for the simulators so to answer you question there you can ignore those completely. The relevant parts of this CMakeScript is the linking of the libraries and the converting of the .pte to a header-file, with that done you can approach this as writing for any other application for u65.
The simulator is of course very useful when developing so if you have not done so, I would suggest to start testing your model and executor_runner on u55 using the Corstone-300 target, and move to u65 when you have that working.
I would suggest to start testing your model and executor_runner on u55 using the Corstone-300 target
I have tried performing inference on Corstone-300 FVP by following the steps mentioned here. With this I am able to perform inference on the FVP (I have tried simple model with "ADD" operation)
I [executorch:arm_perf_monitor.cpp:133] NPU Inferences : 1
I [executorch:arm_perf_monitor.cpp:134] Profiler report, CPU cycles per operator:
I [executorch:arm_perf_monitor.cpp:138] ethos-u : cycle_cnt : 0 cycles
I [executorch:arm_perf_monitor.cpp:145] Operator(s) total: 0 CPU cycles
I [executorch:arm_perf_monitor.cpp:151] Inference runtime: 2585 CPU cycles total
I [executorch:arm_perf_monitor.cpp:153] NOTE: CPU cycle values and ratio calculations require FPGA and identical CPU/NPU frequency
I [executorch:arm_perf_monitor.cpp:162] Inference CPU ratio: 90.10 %
I [executorch:arm_perf_monitor.cpp:166] Inference NPU ratio: 9.90 %
I [executorch:arm_perf_monitor.cpp:175] cpu_wait_for_npu_cntr : 256 CPU cycles
I [executorch:arm_perf_monitor.cpp:180] Ethos-U PMU report:
I [executorch:arm_perf_monitor.cpp:181] ethosu_pmu_cycle_cntr : 411
I [executorch:arm_perf_monitor.cpp:184] ethosu_pmu_cntr0 : 6
I [executorch:arm_perf_monitor.cpp:184] ethosu_pmu_cntr1 : 43
I [executorch:arm_perf_monitor.cpp:184] ethosu_pmu_cntr2 : 3
I [executorch:arm_perf_monitor.cpp:184] ethosu_pmu_cntr3 : 634
I [executorch:arm_perf_monitor.cpp:187] Ethos-U PMU Events:[ETHOSU_PMU_AXI0_RD_DATA_BEAT_RECEIVED, ETHOSU_PMU_AXI1_RD_DATA_BEAT_RECEIVED, ETHOSU_PMU_AXI0_WR_DATA_BEAT_WRITTEN, ETHOSU_PMU_NPU_IDLE]
I [executorch:arm_executor_runner.cpp:630] model_pte_program_size: 2032 bytes.
I [executorch:arm_executor_runner.cpp:631] model_pte_loaded_size: 2032 bytes.
I [executorch:arm_executor_runner.cpp:645] method_allocator_used: 308 / 62914560 free: 62914252 ( used: 0 % )
I [executorch:arm_executor_runner.cpp:652] method_allocator_planned: 64 bytes
I [executorch:arm_executor_runner.cpp:654] method_allocator_loaded: 220 bytes
I [executorch:arm_executor_runner.cpp:655] method_allocator_input: 24 bytes
I [executorch:arm_executor_runner.cpp:656] method_allocator_executor: 0 bytes
I [executorch:arm_executor_runner.cpp:659] temp_allocator_used: 0 / 1048576 free: 1048576 ( used: 0 % )
I [executorch:arm_executor_runner.cpp:675] Model executed successfully.
I [executorch:arm_executor_runner.cpp:679] 1 outputs:
Output[0][0]: (int) 2
Output[0][1]: (int) 2
Output[0][2]: (int) 2
Output[0][3]: (int) 2
Output[0][4]: (int) 2
However, I have observed that the arm_executorch_runner application is (~62 MB) which is huge (Am I missing something here?)
Nope that is correct, the example just allocates a 60MB buffer so we can test/used large models out of the box as the FVP can use quite much memory.
See
#if !defined(ET_ARM_BAREMETAL_METHOD_ALLOCATOR_POOL_SIZE)
#define ET_ARM_BAREMETAL_METHOD_ALLOCATOR_POOL_SIZE (60 * 1024 * 1024)
#endif
const size_t method_allocation_pool_size =
ET_ARM_BAREMETAL_METHOD_ALLOCATOR_POOL_SIZE;
unsigned char __attribute__((
section("input_data_sec"),
aligned(16))) method_allocation_pool[method_allocation_pool_size];
You can either just change the code or set ET_ARM_BAREMETAL_METHOD_ALLOCATOR_POOL_SIZE from cmake as a workaround.
We hope/plan to look into making the handling of this area better, and not ending up in the elf and such, but right now it is working like this. It's just a bad left over from when we "forked" from the examples/devtools/example_runner/example_runner.cpp :)
The examples/arm/executor_runner-code and related CMake-scripts used to built it should be viewed as an example to get you started when building your own application...
@AdrianLundell, I tried to build my own application by adding CMake-scripts but I have hit a road block in providing custom linker script. I almost spent 2 weeks with not much progress :(
So, I gave it up and then tried a second approach, where I will try to integrate Executorch libraries into a "working firmware application" that is available for my board (which comes with its own linker script). With this approach, atleast I can see the application is launching but I have received a following error message from Executorch runtime
E [executorch:method.cpp:748] Missing operator: [0] aten::add.out
E [executorch:method.cpp:989] There are 1 instructions don't have corresponding operator registered.See logs for details
(I have taken the arm_executor_runner as a reference for my application and using a sample pte model with only "add" operation)
It looks like there is some operator registry in which all the ops need to be registered, but I am not sure how it works. So, can I please get some insights into this "operator registration"?
Also, when I tried the examples, it looks like "add" is mapped to ethos-u delegate, so when I try to run it without delegate option, even on simulator I observed similar errors
Thanks!
Hi, Are you still using the examples scripts? If yes, you can give the --portable_kernels=aten::add.out to the run.sh and it will do the job. The full command will be: ./run.sh --model_name=add --aot_arm_compiler_flags="" --portable_kernels=aten::add.out. If it is not the case, you can take a look at the backends/arm/scripts/build_portable_kernels.sh script to build the custom operators library and link it to your executable.
Hi, With that option, I can see the issue is fixed on simulator! But, I tried to do the same for the application that I have built by invoking the following scripts while building the Executorch libs
./executorch/backends/arm/scripts/build_executorch.sh
./executorch/backends/arm/scripts/build_portable_kernels.sh --portable_kernels="aten::add.out"
And here are the libs that I am linking to my application. (I am linking all *.a generated in arm_test/cmake-out folder
libexecutorch.a
libexecutorch_core.a
libarm_portable_ops_lib.a
libexecutorch_delegate_ethos_u.a
libextension_runner_util.a
libextension_tensor.a
liboptimized_portable_kernels.a
libportable_kernels.a
libportable_ops_lib.a
libquantized_kernels.a
libquantized_ops_lib.a
libarm_portable_ops_lib.a
But I am still facing that issue in my application :(
I can confirm that the build logs generated by CMake are identical b/w simulator and my application
Sounds like a good approach to me to start with a known working application and build from there!
For the operator registration, the executorch runtime has no operator implementations by default, so you need to cross compile them for Cortex-M55 using the executorch/backends/arm/scripts/build_portable_kernels.sh script and link them as described by Juanfi8. You could try debugging using the toolchain tools s.a. executorch/examples/arm/ethos-u-scratch/arm-gnu-toolchain-13.3.rel1-x86_64-arm-none-eabi/bin/arm-none-eabi-readelf/objdump to inspect your binaries and see what is included. Conceptually there should not be a difference between the simulator and a real application so I suspect there is an issue in how you build. Are there any more logs you could share since the error message mentions "See logs for details"?
This error suggests to me however that the network has not lowered probably since the add operator should be delegated to the Ethos-U rather than run on CPU, but maybe you have not come to that part yet?
This error suggests to me however that the network has not lowered probably since the add operator should be delegated to the Ethos-U rather than run on CPU, but maybe you have not come to that part yet?...
Yes, I wanted to start it slow, by first executing directly on my CPU (Cortex-M33) and then delegate it to Ethos-U65
I will try to compare both builds to see what I am missing, but one thing I am not sure is that, does the "kernel not found" indicate that, the above mentioned libs that I linked somehow don't contain ADD operation or is it related to some configuration similar to
--portable_kernels=aten::add.out which I missed earlier?
Can you confirm the above mentioned libs are sufficient or am I missing something?
"See logs for details"?...
I just haven't implemented that part yet ;)
I see, it could be a problem with the bindings as well, from the examples/arm/CmakeLists.txt:
# Generate C++ bindings to register kernels into both PyTorch (for AOT) and
# Executorch (for runtime). Here select all ops in functions.yaml
gen_selected_ops(
LIB_NAME
"arm_portable_ops_lib"
OPS_SCHEMA_YAML
""
ROOT_OPS
"${EXECUTORCH_SELECT_OPS_LIST}"
INCLUDE_ALL_OPS
""
)
generate_bindings_for_kernels(
LIB_NAME "arm_portable_ops_lib" FUNCTIONS_YAML
${EXECUTORCH_ROOT}/kernels/portable/functions.yaml
)
gen_operators_lib(
LIB_NAME "arm_portable_ops_lib" KERNEL_LIBS portable_kernels DEPS executorch
)
Are you doing this?
I tried comparing .map file of my application with that of simulator and observed that in my app, the symbols from some of the libs libportable_kernels.a, libportable_ops_lib.a,... are not being included. (I think that might be the reason for this issue)
So, I tried linking the libs with --whole-archive option and observed overflows in .text and .data section.
I am not sure if I can afford such huge memories for my board (as Cortex-M33 is on a ML-Island).
So, is there a way, where I can target only specific libs (for the kernels) among the list, so that I will try to fit only those in the available memories?
This might be a longshot but there is something called selective build type, I have not checked it out at all but it might be something that brings your size down a bit. There is a comment about it here:
https://github.com/pytorch/executorch/issues/7177
For portable libs there is a yml file to select ops, there might be something for the other kernels also.
Hi again, I have managed to overcome "kernel not found" issue by linking the libs with --whole-archive and also following selective build mentioned by @zingo.
However, I have observed that even though the library size is reduced, in the map file of my ELF I am able to see a whole bunch of other operations symbols that I am not using like softmax, cos, sin etc... I can see that, the selected_operators.yaml that is generated only contains following lines related to my "add" operation :
build_features: []
custom_classes: []
et_kernel_metadata:
aten::add.out:
- default
include_all_non_op_selectives: false
include_all_operators: false
kernel_metadata: {}
operators:
aten::add.out:
debug_info:
- None
include_all_overloads: false
is_root_operator: true
is_used_for_training: true
So, I am not sure, why I am still seeing the symbols related to unwanted operations in my map file. Can someone please provide their insights into this?
I believe the unwanted symbols come from the kernel library libportable_kernels.a linked in with the --whole-archive flag. Note that this is different from the kernel registration library libarm_portable_ops_lib.a, which contains bindings from operator names selected in the yaml-file to symbols in the kernel library. I have not seen anything about how to build only select parts of the kernel library, however you should be able to link only wanted parts from it.
Can you check and see if my suspicion is correct?
@AdrianLundell , you are right! The unwanted symbols are coming from libportable_kernels.a as it is linked with --whole-archive.
So, it looks like the --whole-archive is required only for libarm_portable_ops_lib.a and it is some how getting required ops symbols from libportable_kernels.a without including unwanted ones.
Thanks!
Hi everyone, I am able to launch some basic examples on my Cortex-M33 like softmax and gelu!
But as I am trying other fundamental OPs like add, mul, sigmoid, etc, I am facing hard faults as mentioned in this query
It looks like the issue is observed only on Cortex-M33 (I tried on Corstone simulator Cortex-M55 where things are working fine)
For sigmoid, I was able to overcome this issue by moving back to older implementation which uses apply_unary_map_fn
So, can someone please provide their insights into this issue or suggest a workaround which can allow us to try the OPs until a solution is available?
(I tried to implement apply_unary_map_fn in the "add" operation, with which I don't have much luck as it involves two input tensors in contrast to sigmoid which only needs one)
Great progress, if you do not want to debug or wait for a fix I think your best bet would be to hack the apply_bitensor_elementwise_fn to be less complex (for example only handle int8) and hopefully get it working. If that does not take you anywhere, you could try rewriting the add op to just use a loop. Proper Cortex-M support is being discussed as mentioned by Zingo, but it is not something we are working actively right now since we are focusing on the NPU.
@AdrianLundell , I am planning to debug the utils::apply_bitensor_elementwise_fn.
If that does not take you anywhere, you could try rewriting the add op to just use a loop
By this statement I assume that, utils::apply_bitensor_elementwise_fn is internally using some ::executorch::extension::parallel_for(), do you mean I need to replace it with something simple like a for loop?
Can you please elaborate more on this "loop" thing?
Sorry I should be more clear, I literally mean a simple for loop just as a way to get something working and to start ruling out what is going wrong. Much appreciated that you are helping with debugging this!
Hi, I'm trying to accomplish the same as @vikasbalaga and I came to the same conclusion. Thus, I tried debugging the utils::apply_bitensor_elementwise_fn and I reached a dead end: I used the ET_LOG (is not the best way but I wanted to have an idea) to find a possible culprit and I realized that the hard fault seems to come from: https://github.com/pytorch/executorch/blob/a073668637944da87100fe66154a9e0c95909318/kernels/portable/cpu/util/elementwise_util.h#L112 during the first iteration. @vikasbalaga can you test this by yourself? Do you have any more clues?
@Juanfi8, I was also debugging that instruction, in fact I tried to replace the input_info.load_to_compute() with a dummy instruction (shown below) then I observed the hard fault occurs at another point store_compute_to_out().
/*loaded_inputs[idx] = input_info.load_to_compute( //First point where hard fault occurs
&input_info
.data_ptr[indexes[idx + 1] * input_info.element_size]);*/
loaded_inputs[idx] = 0;
}
auto result = std::apply(compute_fun, loaded_inputs);
/*store_compute_to_out( //Second point where hard fault occurs
result, &data_out[indexes[0] * out_element_size]);*/
data_out[indexes[0] * out_element_size] = 0;
It looks like these input/output functions seems to have issues which are apparently observed on Cortex-M33 but not on Cortex-M55
I will try to explore more into these to find any more clues.
It looks like these input/output functions seems to have issues which are apparently observed on Cortex-M33 but not on Cortex-M55
This is interesting. Are you building the portable library with MCU specific -mcpu like flags?
It looks like these input/output functions seems to have issues
Yeah, esp after commenting the load if you are hitting a fault in the store, I suspect something to do with the memory access? Are there any difference in memory setup between the two which can cause this?
This is interesting. Are you building the portable library with MCU specific -mcpu like flags?
Yes, I have modified the -mcpu flags to point to cortex-m33 instead of cortex-m55 here and here and can confirm that in the cmake-out folder, the generated cmake output files only contain reference to -mcpu=cortex-m33
I suspect something to do with the memory access? Are there any difference in memory setup between the two which can cause this?
The cortex-m55, I am referring to is from "Corstone-300" simulator and I am not sure of it's memory setup.
But, I am sure that, I have provided sufficient memory to the memory allocator (along with heap and stack), being used in my application. In fact, when I tested sigmoid operation, it seems to be crashing at those instructions, whereas replacing them with apply_unary_map_fn seems to be working fine in the same memory setup of the application.
It looks like these input/output functions seems to have issues which are apparently observed on Cortex-M33 but not on Cortex-M55
I debugged more into this issue (by taking "Sigmoid" operation as an example) and I observed that :
- get_load_to_compute_fn() is internally invoking get_load_to_compute_fn_realhbbf16() which returns a NULL ptr
- get_store_compute_to_tensor_fn() is internally invoking get_store_compute_to_tensor_fn_floathbf16 which returns a NULL ptr
I suspect these NULL function pointers are responsible for hard faults (@Juanfi8 , can you also please confirm?)
So, I am not sure why they are being returned as null pointers (Does anyone have any insights onto this?)
I haven't tried for other OPs (add, mul), but with these observations, I am suspecting the issue is more to do with our setup issues rather than CPU architectures.