Running aten::add.out instead of aten::_softmax.out leads to FLASH overflow
🐛 Describe the bug
Hi, I am running the following commands to build the bare-metal libraries, as mentioned in the documentation (https://pytorch.org/executorch/stable/executorch-arm-delegate-tutorial.html). This time, I want to try the add.out instead of softmax.out. The softmax operation works perfectly, but when I try to compile with the add operation, I run into a FLASH overflow.
cmake \
-DBUCK2=/tmp/buck2 \
-DCMAKE_INSTALL_PREFIX=$(pwd)/cmake-out \
-DEXECUTORCH_BUILD_EXECUTOR_RUNNER=OFF \
-DCMAKE_BUILD_TYPE=Release \
-DEXECUTORCH_ENABLE_LOGGING=ON \
-DEXECUTORCH_BUILD_ARM_BAREMETAL=ON \
-DEXECUTORCH_BUILD_EXTENSION_RUNNER_UTIL=ON \
-DFLATC_EXECUTABLE="$(which flatc)" \
-DCMAKE_TOOLCHAIN_FILE=$(pwd)/examples/arm/ethos-u-setup/arm-none-eabi-gcc.cmake \
-B$(pwd)/cmake-out \
$(pwd)
cmake --build $(pwd)/cmake-out -j4 --target install --config Release
cmake \
-DCMAKE_INSTALL_PREFIX=$(pwd)/cmake-out \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_TOOLCHAIN_FILE=$(pwd)/examples/arm/ethos-u-setup/arm-none-eabi-gcc.cmake \
-DEXECUTORCH_SELECT_OPS_LIST="aten::add.out" \
-B$(pwd)/cmake-out/examples/arm \
$(pwd)/examples/arm
cmake --build $(pwd)/cmake-out/examples/arm --config Release
I also adjusted the AddModule in aot_arm_compiler.py since I am using a Cortex M4 Architecture:
class AddModule(torch.nn.Module):
def __init__(self):
super().__init__()
def forward(self, x):
return x + x
example_input = (torch.ones(2,2),)
can_delegate = False
Why does the operation add.out cause a FLASH overflow while the arm_runner.cpp with softmax.out is five times smaller than the maximum executable size on my hardware?
Versions
PyTorch version: 2.4.1+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A
OS: Ubuntu 22.04.5 LTS (x86_64) GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version: Could not collect CMake version: version 3.28.3 Libc version: glibc-2.35
Python version: 3.10.12 (main, Sep 11 2024, 15:47:36) [GCC 11.4.0] (64-bit runtime) Python platform: Linux-5.15.0-122-generic-x86_64-with-glibc2.35 Is CUDA available: True CUDA runtime version: Could not collect CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA GeForce GTX 950M Nvidia driver version: 535.183.01 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True
CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 39 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 8 On-line CPU(s) list: 0-7 Vendor ID: GenuineIntel Model name: Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz CPU family: 6 Model: 94 Thread(s) per core: 2 Core(s) per socket: 4 Socket(s): 1 Stepping: 3 CPU max MHz: 3500,0000 CPU min MHz: 800,0000 BogoMIPS: 5199.98 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d arch_capabilities Virtualization: VT-x L1d cache: 128 KiB (4 instances) L1i cache: 128 KiB (4 instances) L2 cache: 1 MiB (4 instances) L3 cache: 6 MiB (1 instance) NUMA node(s): 1 NUMA node0 CPU(s): 0-7 Vulnerability Gather data sampling: Vulnerable: No microcode Vulnerability Itlb multihit: KVM: Mitigation: VMX disabled Vulnerability L1tf: Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable Vulnerability Mds: Mitigation; Clear CPU buffers; SMT vulnerable Vulnerability Meltdown: Mitigation; PTI Vulnerability Mmio stale data: Mitigation; Clear CPU buffers; SMT vulnerable Vulnerability Reg file data sampling: Not affected Vulnerability Retbleed: Mitigation; IBRS Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; IBRS; IBPB conditional; STIBP conditional; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected Vulnerability Srbds: Mitigation; Microcode Vulnerability Tsx async abort: Mitigation; TSX disabled
Versions of relevant libraries: [pip3] executorch==0.1.0 [pip3] numpy==2.1.1 [pip3] torch==2.4.1 [pip3] torchsr==1.0.4 [pip3] torchvision==0.19.1 [pip3] triton==3.0.0 [conda] Could not collect
@digantdesai Can you help take a look?
Hi @ChristophKarlHeck, I'm happy to see you're trying this out on a Cortex-M4! There can be multiple reasons for your issue, but a good starting point is to check:
https://github.com/pytorch/executorch/blob/341545c0c11164d0381ac11871762cf0481ab60c/examples/arm/executor_runner/arm_executor_runner.cpp#L66 https://github.com/pytorch/executorch/blob/341545c0c11164d0381ac11871762cf0481ab60c/examples/arm/executor_runner/arm_executor_runner.cpp#L71
As you can see those are pretty large. We use that large number to be able to run "any" model with the same code. A catch-all... Try reduce those numbers to fit your usecase. (Heads up, #5580 is in-flight and will alter code in this area, but same idea.)
I guess you have ported: https://github.com/pytorch/executorch/blob/main/examples/arm/ethos-u-setup/arm-none-eabi-gcc.cmake and the linker files etc. to your target platform?
At some point we want to give this code some more love, but we haven't gotten to it yet.
If you've already have tried this, it would be helpful with some more info. By how much is the flash overflowed? Have you tried building this for the Corstone-300 FVP target, and if so, did it work? A build log would also be nice.
Let us know if this helps. Cheers!
Except for the METHOD_ALLOCATOR_POOL_SIZE it seem the text size (using the size command on the generated elf) seems to grow when going from -DEXECUTORCH_SELECT_OPS_LIST="aten::_softmax.out" to -DEXECUTORCH_SELECT_OPS_LIST="aten::add.out"
The non delegated softmax example
executable_text: 534524 bytes
executable_data: 74454388 bytes <--- this is mostly the mentiond METHOD_ALLOCATOR_POOL_SIZE above
executable_bss: 39632 bytes
pte_data_size: 960 bytes
Your non delegated add
executable_text: 816540 bytes <----- What happened here?????
executable_data: 74455676 bytes
executable_bss: 39632 bytes
pte_data_size: 2240 bytes
A hint that might give us more information could be if you manage to generate a Map file in the 2 cases and comparing it maybe you need to fake a large memory to generate it.
@digantdesai Have you experiences this before? Could there be some portable lib dependency kicking in wrongly when generating one with only an add?
Hi @freddan80,
Thank you for the response. In general, I am working on the following project: https://github.com/ChristophKarlHeck/mbed-torch-fusion-os using the ceb1f1d05ceab420644a3633264f13547dc02411 commit of ExecuTorch.
__attribute__((section(".sram.data"), aligned(16)))
uint8_t method_allocator_pool[4 * 1024U];
I am not sure if I can go smaller than that... How can I figure it out? The size of the model? Yes, I am using the https://github.com/pytorch/executorch/blob/main/examples/arm/ethos-u-setup/arm-none-eabi-gcc.cmake to target the Cortex-M4 platform (NUCLEO_WB55RG) and refined it with this patch https://github.com/ChristophKarlHeck/mbed-torch-fusion-os/blob/main/utils/patches/cmake_file_m4.patch Here are the build logs for
libextension_runner_util.a
libexecutorch.a
libportable_kernels.a
build_logs_basic_libs.txt I'd like to know why we compile all the operations in the basic libs? Therefore, I tried commenting out the unnecessary ones in the functions.yaml, but it didn't make a difference.
Here are the softmax build logs for
libarm_portable_ops_lib.a
Here are the build logs of the arm_executor_runner.cpp with aten::_softmax.out: build_arm_runner_softmax_logs.txt
Here are the add build logs for
libarm_portable_ops_lib.a
Here are the build logs of arm runner with aten::add.out: build_arm_runner_add_logs.txt
I will try to run it on the Corstone-300 FVP target tomorrow and keep you posted! Is there anything else I can do to find the error?
Hi @zingo, I am new to analyzing .map files. I used diff to compare them, but I couldn't find the error - the file that causes the size issue. What tool do you use to compare .map files? I had to zip the .map-files since each has 37.5MB. map_files.zip
I don't have a goof tool more then some sort of diff tool (I'm using meld) then I just try to see if there is a lot of something in one file that is not in the other. Or if some size have been changed a lot.
Looking at your map files I see that libportable_kernels.a seem to be something to check looking in the .text sections for the Softmax map file I only see
.text._ZZZZN5torch8executor6native11softmax_outERNS0_20KernelRuntimeContextERKNS0_6TensorExbRS4_ENKUlvE_clEvENKUlvE1_clEvENKUljjjE_clEjjj.isra.0
0x080102e4 0x690 /home/chris/mbed-torch-fusion-os/executorch/cmake-out/kernels/portable/libportable_kernels.a(op_softmax.cpp.obj)
.text._ZN5torch8executor14getLeadingDimsERKNS0_6TensorEx.part.0
0x08010974 0x60 /home/chris/mbed-torch-fusion-os/executorch/cmake-out/kernels/portable/libportable_kernels.a(op_softmax.cpp.obj)
.text._ZN5torch8executor6native11softmax_outERNS0_20KernelRuntimeContextERKNS0_6TensorExbRS4_
0x080109d4 0xa88 /home/chris/mbed-torch-fusion-os/executorch/cmake-out/kernels/portable/libportable_kernels.a(op_softmax.cpp.obj)
0x080109d4 torch::executor::native::softmax_out(torch::executor::KernelRuntimeContext&, torch::executor::Tensor const&, long long, bool, torch::executor::Tensor&)
.text._ZN5torch8executor22check_log_softmax_argsERKNS0_6TensorExbRS1_
0x0801145c 0x4d4 /home/chris/mbed-torch-fusion-os/executorch/cmake-out/kernels/portable/libportable_kernels.a(activation_ops_util.cpp.obj)
0x0801145c torch::executor::check_log_softmax_args(torch::executor::Tensor const&, long long, bool, torch::executor::Tensor&)
.text._ZN5torch8executor18check_softmax_argsERKNS0_6TensorExbRS1_
0x08011930 0xc /home/chris/mbed-torch-fusion-os/executorch/cmake-out/kernels/portable/libportable_kernels.a(activation_ops_util.cpp.obj)
0x08011930 torch::executor::check_softmax_args(torch::executor::Tensor const&, long long, bool, torch::executor::Tensor&)
where the add has a really long list of things added (I cut out a lot and replace it with ... below)
.text._ZN5torch8executor27apply_binary_elementwise_fnIbbaZZZZZZZZZNS0_6native7add_outERNS0_20KernelRuntimeContextERKNS0_6TensorES7_RKNS0_6ScalarERS5_ENKUlvE_clEvENKUlvE7_clEvENKUlvE_clEvENKUlvE7_clEvENKUlvE_clEvENKUlvE2_clEvENKUlvE_clEvENKUlvE0_clEvEUlbbE_EEvRKT2_S7_S7_S7_
0x080101d4 0x1a8 /home/chris/mbed-torch-fusion-os/executorch/cmake-out/kernels/portable/libportable_kernels.a(op_add.cpp.obj)
.text._ZN5torch8executor27apply_binary_elementwise_fnIaahZZZZZZZZZNS0_6native7add_outERNS0_20KernelRuntimeContextERKNS0_6TensorES7_RKNS0_6ScalarERS5_ENKUlvE_clEvENKUlvE0_clEvENKUlvE_clEvENKUlvE0_clEvENKUlvE_clEvENKUlvE1_clEvENKUlvE_clEvENKUlvE_clEvEUlaaE_EEvRKT2_S7_S7_S7_
0x0801037c 0x1a8 /home/chris/mbed-torch-fusion-os/executorch/cmake-out/kernels/portable/libportable_kernels.a(op_add.cpp.obj)
.text._ZN5torch8executor27apply_binary_elementwise_fnIhbaZZZZZZZZZNS0_6native7add_outERNS0_20KernelRuntimeContextERKNS0_6TensorES7_RKNS0_6ScalarERS5_ENKUlvE_clEvENKUlvE_clEvENKUlvE_clEvENKUlvE7_clEvENKUlvE_clEvENKUlvE1_clEvENKUlvE_clEvENKUlvE0_clEvEUlhbE_EEvRKT2_S7_S7_S7_
0x08010524 0x1ac /home/chris/mbed-torch-fusion-os/executorch/cmake-out/kernels/portable/libportable_kernels.a(op_add.cpp.obj)
.text._ZN5torch8executor27apply_binary_elementwise_fnIaaaZZZZZZZZZNS0_6native7add_outERNS0_20KernelRuntimeContextERKNS0_6TensorES7_RKNS0_6ScalarERS5_ENKUlvE_clEvENKUlvE0_clEvENKUlvE_clEvENKUlvE0_clEvENKUlvE_clEvENKUlvE1_clEvENKUlvE_clEvENKUlvE0_clEvEUlaaE_EEvRKT2_S7_S7_S7_
0x080106d0 0x1a8 /home/chris/mbed-torch-fusion-os/executorch/cmake-out/kernels/portable/libportable_kernels.a(op_add.cpp.obj)
.text._ZN5torch8executor27apply_binary_elementwise_fnIbbaZZZZZZZZZNS0_6native7add_outERNS0_20KernelRuntimeContextERKNS0_6TensorES7_RKNS0_6ScalarERS5_ENKUlvE_clEvENKUlvE7_clEvENKUlvE_clEvENKUlvE7_clEvENKUlvE_clEvENKUlvE1_clEvENKUlvE_clEvENKUlvE0_clEvEUlbbE_EEvRKT2_S7_S7_S7_
0x08010878 0x1ac /home/chris/mbed-torch-fusion-os/executorch/cmake-out/kernels/portable/libportable_kernels.a(op_add.cpp.obj)
...
.text._ZZZZZZN5torch8executor6native7add_outERNS0_20KernelRuntimeContextERKNS0_6TensorES6_RKNS0_6ScalarERS4_ENKUlvE_clEvENKUlvE_clEvENKUlvE_clEvENKUlvE2_clEvENKUlvE_clEv.isra.0
0x08314f38 0x1ad8 /home/chris/mbed-torch-fusion-os/executorch/cmake-out/kernels/portable/libportable_kernels.a(op_add.cpp.obj)
.text._ZZZZZZN5torch8executor6native7add_outERNS0_20KernelRuntimeContextERKNS0_6TensorES6_RKNS0_6ScalarERS4_ENKUlvE_clEvENKUlvE_clEvENKUlvE_clEvENKUlvE1_clEvENKUlvE_clEv.isra.0
0x08316a10 0x1ad8 /home/chris/mbed-torch-fusion-os/executorch/cmake-out/kernels/portable/libportable_kernels.a(op_add.cpp.obj)
.text._ZZZZZZN5torch8executor6native7add_outERNS0_20KernelRuntimeContextERKNS0_6TensorES6_RKNS0_6ScalarERS4_ENKUlvE_clEvENKUlvE_clEvENKUlvE_clEvENKUlvE0_clEvENKUlvE_clEv.isra.0
0x083184e8 0x1ad8 /home/chris/mbed-torch-fusion-os/executorch/cmake-out/kernels/portable/libportable_kernels.a(op_add.cpp.obj)
.text._ZZN5torch8executor6native7add_outERNS0_20KernelRuntimeContextERKNS0_6TensorES6_RKNS0_6ScalarERS4_ENKUlvE_clEv.isra.0
0x08319fc0 0x15c4 /home/chris/mbed-torch-fusion-os/executorch/cmake-out/kernels/portable/libportable_kernels.a(op_add.cpp.obj)
.text._ZN5torch8executor6native7add_outERNS0_20KernelRuntimeContextERKNS0_6TensorES6_RKNS0_6ScalarERS4_
0x0831b584 0x2f0 /home/chris/mbed-torch-fusion-os/executorch/cmake-out/kernels/portable/libportable_kernels.a(op_add.cpp.obj)
0x0831b584 torch::executor::native::add_out(torch::executor::KernelRuntimeContext&, torch::executor::Tensor const&, torch::executor::Tensor const&, torch::executor::Scalar const&, torch::executor::Tensor&)
.text._ZN5torch8executor33tensors_are_broadcastable_betweenENS0_8ArrayRefIlEES2_
0x0831b874 0x5c /home/chris/mbed-torch-fusion-os/executorch/cmake-out/kernels/portable/libportable_kernels.a(broadcast_util.cpp.obj)
0x0831b874 torch::executor::tensors_are_broadcastable_between(torch::executor::ArrayRef<long>, torch::executor::ArrayRef<long>)
.text._ZN5torch8executor25get_broadcast_target_sizeENS0_8ArrayRefIlEES2_PljPj
0x0831b8d0 0x110 /home/chris/mbed-torch-fusion-os/executorch/cmake-out/kernels/portable/libportable_kernels.a(broadcast_util.cpp.obj)
0x0831b8d0 torch::executor::get_broadcast_target_size(torch::executor::ArrayRef<long>, torch::executor::ArrayRef<long>, long*, unsigned int, unsigned int*)
.text._ZN5torch8executor25get_broadcast_target_sizeERKNS0_6TensorES3_PljPj
0x0831b9e0 0x38 /home/chris/mbed-torch-fusion-os/executorch/cmake-out/kernels/portable/libportable_kernels.a(broadcast_util.cpp.obj)
0x0831b9e0 torch::executor::get_broadcast_target_size(torch::executor::Tensor const&, torch::executor::Tensor const&, long*, unsigned int, unsigned int*)
.text._ZN5torch8executor17delinearize_indexEjNS0_8ArrayRefIlEEPjj
0x0831ba18 0x80 /home/chris/mbed-torch-fusion-os/executorch/cmake-out/kernels/portable/libportable_kernels.a(broadcast_util.cpp.obj)
0x0831ba18 torch::executor::delinearize_index(unsigned int, torch::executor::ArrayRef<long>, unsigned int*, unsigned int)
.text._ZN5torch8executor17delinearize_indexEjRKNS0_6TensorEPjj
0x0831ba98 0x28 /home/chris/mbed-torch-fusion-os/executorch/cmake-out/kernels/portable/libportable_kernels.a(broadcast_util.cpp.obj)
0x0831ba98 torch::executor::delinearize_index(unsigned int, torch::executor::Tensor const&, unsigned int*, unsigned int)
.text._ZN5torch8executor24linearize_access_indexesENS0_8ArrayRefIjEEiNS1_IlEES3_
0x0831bac0 0xf0 /home/chris/mbed-torch-fusion-os/executorch/cmake-out/kernels/portable/libportable_kernels.a(broadcast_util.cpp.obj)
0x0831bac0 torch::executor::linearize_access_indexes(torch::executor::ArrayRef<unsigned int>, int, torch::executor::ArrayRef<long>, torch::executor::ArrayRef<long>)
.text._ZN5torch8executor24linearize_access_indexesENS0_8ArrayRefIjEEiRKNS0_6TensorE
0x0831bbb0 0x40 /home/chris/mbed-torch-fusion-os/executorch/cmake-out/kernels/portable/libportable_kernels.a(broadcast_util.cpp.obj)
0x0831bbb0 torch::executor::linearize_access_indexes(torch::executor::ArrayRef<unsigned int>, int, torch::executor::Tensor const&)
.text._ZN5torch8executor16check_alpha_typeENS0_10ScalarTypeES1_
0x0831bbf0 0x94 /home/chris/mbed-torch-fusion-os/executorch/cmake-out/kernels/portable/libportable_kernels.a(kernel_ops_util.cpp.obj)
0x0831bbf0 torch::executor::check_alpha_type(torch::executor::ScalarType, torch::executor::ScalarType)
Most of it seem to be from op_add.cpp.obj it seem to "cost" almost 3Mb compare to the op_softmax.cpp.obj that only seem to cost about 47Kb
It seems the template expansions in op_add.cpp could be the problem.
Even if this don't help you directly I hope it gives some clues where to look more.
Hi @zingo, Thank you for the help. It gives me some hints about where to look more. I'll update you about the outcome when I find the cause for the increase in the humongous size.
Most of it seem to be from op_add.cpp.obj it seem to "cost" almost 3Mb compare to the op_softmax.cpp.obj that only seem to cost about 4
If you compile a portable op it can be pretty large since it generates many dtype combinations. One way is to use selective build i.e. link only the op you want for the dype you want.
@larryliu0820 can you show how to specify dtype in a yaml? Also @Olivia-liu can help? :)
Short update:
I updated my ExecuTorch version to v0.3.0 and linked the libexecutorch_no_prim_ops.a to my target. Hence, I can use the 'aten::add.out' without a FLASH overflow.
Besides that, it would be beneficial to know how to do the selective build since future work will use it.
I appreciate any help you can provide.
Here's the README for selective builds. Hope it helps! https://github.com/pytorch/executorch/blob/main/examples/selective_build/README.md
Hi, I just spotted this recent merge: https://github.com/pytorch/executorch/pull/6006 I'm on my phone and cant check too much right now, so I'm not sure if this is the code you end up using for add, but could be worth to check if it improves your case.
Hi @zingo, thank you very much! I will be on vacation the next two weeks then I'll check the improvements and let you know!
Hi @ChristophKarlHeck Is this a problem? Wonder if we can close this issue? :)
Hi @zingo, we can close the issue :) sorry for not replying in time! Thank you
No problem, I'm a bit interested if there is something extra you figured out to make it smaller that would be good to know 😊
I didn't figure it out since the change to executorch==0.3.0 has solved the problem. So, no further analysis was necessary :)
Thanks for the update.