executorch Running aten::add.out instead of aten::_softmax.out leads to FLASH overflow

🐛 Describe the bug

Hi, I am running the following commands to build the bare-metal libraries, as mentioned in the documentation (https://pytorch.org/executorch/stable/executorch-arm-delegate-tutorial.html). This time, I want to try the add.out instead of softmax.out. The softmax operation works perfectly, but when I try to compile with the add operation, I run into a FLASH overflow.

cmake                                                                           \
    -DBUCK2=/tmp/buck2                                                          \
    -DCMAKE_INSTALL_PREFIX=$(pwd)/cmake-out            \
    -DEXECUTORCH_BUILD_EXECUTOR_RUNNER=OFF            \
    -DCMAKE_BUILD_TYPE=Release                        \
    -DEXECUTORCH_ENABLE_LOGGING=ON                    \
    -DEXECUTORCH_BUILD_ARM_BAREMETAL=ON               \
    -DEXECUTORCH_BUILD_EXTENSION_RUNNER_UTIL=ON       \
    -DFLATC_EXECUTABLE="$(which flatc)"               \
    -DCMAKE_TOOLCHAIN_FILE=$(pwd)/examples/arm/ethos-u-setup/arm-none-eabi-gcc.cmake  \
    -B$(pwd)/cmake-out                                \
    $(pwd)

cmake --build $(pwd)/cmake-out -j4 --target install --config Release

cmake                                                  \
    -DCMAKE_INSTALL_PREFIX=$(pwd)/cmake-out             \
    -DCMAKE_BUILD_TYPE=Release                         \
    -DCMAKE_TOOLCHAIN_FILE=$(pwd)/examples/arm/ethos-u-setup/arm-none-eabi-gcc.cmake  \
    -DEXECUTORCH_SELECT_OPS_LIST="aten::add.out"  \
    -B$(pwd)/cmake-out/examples/arm                   \
    $(pwd)/examples/arm

cmake --build $(pwd)/cmake-out/examples/arm --config Release

I also adjusted the AddModule in aot_arm_compiler.py since I am using a Cortex M4 Architecture:

class AddModule(torch.nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, x):
        return x + x

    example_input = (torch.ones(2,2),)
    can_delegate = False

Why does the operation add.out cause a FLASH overflow while the arm_runner.cpp with softmax.out is five times smaller than the maximum executable size on my hardware?

Versions

PyTorch version: 2.4.1+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.5 LTS (x86_64) GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version: Could not collect CMake version: version 3.28.3 Libc version: glibc-2.35

Python version: 3.10.12 (main, Sep 11 2024, 15:47:36) [GCC 11.4.0] (64-bit runtime) Python platform: Linux-5.15.0-122-generic-x86_64-with-glibc2.35 Is CUDA available: True CUDA runtime version: Could not collect CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA GeForce GTX 950M Nvidia driver version: 535.183.01 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 39 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 8 On-line CPU(s) list: 0-7 Vendor ID: GenuineIntel Model name: Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz CPU family: 6 Model: 94 Thread(s) per core: 2 Core(s) per socket: 4 Socket(s): 1 Stepping: 3 CPU max MHz: 3500,0000 CPU min MHz: 800,0000 BogoMIPS: 5199.98 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d arch_capabilities Virtualization: VT-x L1d cache: 128 KiB (4 instances) L1i cache: 128 KiB (4 instances) L2 cache: 1 MiB (4 instances) L3 cache: 6 MiB (1 instance) NUMA node(s): 1 NUMA node0 CPU(s): 0-7 Vulnerability Gather data sampling: Vulnerable: No microcode Vulnerability Itlb multihit: KVM: Mitigation: VMX disabled Vulnerability L1tf: Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable Vulnerability Mds: Mitigation; Clear CPU buffers; SMT vulnerable Vulnerability Meltdown: Mitigation; PTI Vulnerability Mmio stale data: Mitigation; Clear CPU buffers; SMT vulnerable Vulnerability Reg file data sampling: Not affected Vulnerability Retbleed: Mitigation; IBRS Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; IBRS; IBPB conditional; STIBP conditional; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected Vulnerability Srbds: Mitigation; Microcode Vulnerability Tsx async abort: Mitigation; TSX disabled

Versions of relevant libraries: [pip3] executorch==0.1.0 [pip3] numpy==2.1.1 [pip3] torch==2.4.1 [pip3] torchsr==1.0.4 [pip3] torchvision==0.19.1 [pip3] triton==3.0.0 [conda] Could not collect

Sep 23 '24 15:09 ChristophKarlHeck

@digantdesai Can you help take a look?

Sep 24 '24 03:09 Olivia-liu

Hi @ChristophKarlHeck, I'm happy to see you're trying this out on a Cortex-M4! There can be multiple reasons for your issue, but a good starting point is to check:

https://github.com/pytorch/executorch/blob/341545c0c11164d0381ac11871762cf0481ab60c/examples/arm/executor_runner/arm_executor_runner.cpp#L66 https://github.com/pytorch/executorch/blob/341545c0c11164d0381ac11871762cf0481ab60c/examples/arm/executor_runner/arm_executor_runner.cpp#L71

As you can see those are pretty large. We use that large number to be able to run "any" model with the same code. A catch-all... Try reduce those numbers to fit your usecase. (Heads up, #5580 is in-flight and will alter code in this area, but same idea.)

I guess you have ported: https://github.com/pytorch/executorch/blob/main/examples/arm/ethos-u-setup/arm-none-eabi-gcc.cmake and the linker files etc. to your target platform?

At some point we want to give this code some more love, but we haven't gotten to it yet.

If you've already have tried this, it would be helpful with some more info. By how much is the flash overflowed? Have you tried building this for the Corstone-300 FVP target, and if so, did it work? A build log would also be nice.

Let us know if this helps. Cheers!

Sep 25 '24 07:09 freddan80

Except for the METHOD_ALLOCATOR_POOL_SIZE it seem the text size (using the size command on the generated elf) seems to grow when going from -DEXECUTORCH_SELECT_OPS_LIST="aten::_softmax.out" to -DEXECUTORCH_SELECT_OPS_LIST="aten::add.out"

The non delegated softmax example

executable_text: 534524 bytes
executable_data: 74454388 bytes    <--- this is mostly the mentiond METHOD_ALLOCATOR_POOL_SIZE above
executable_bss:  39632 bytes
pte_data_size: 960 bytes

Your non delegated add

executable_text: 816540 bytes    <----- What happened here?????
executable_data: 74455676 bytes
executable_bss:  39632 bytes
pte_data_size:   2240 bytes

A hint that might give us more information could be if you manage to generate a Map file in the 2 cases and comparing it maybe you need to fake a large memory to generate it.

@digantdesai Have you experiences this before? Could there be some portable lib dependency kicking in wrongly when generating one with only an add?

Sep 25 '24 08:09 zingo

Hi @freddan80, Thank you for the response. In general, I am working on the following project: https://github.com/ChristophKarlHeck/mbed-torch-fusion-os using the ceb1f1d05ceab420644a3633264f13547dc02411 commit of ExecuTorch.

__attribute__((section(".sram.data"), aligned(16)))
uint8_t method_allocator_pool[4 * 1024U];

I am not sure if I can go smaller than that... How can I figure it out? The size of the model? Yes, I am using the https://github.com/pytorch/executorch/blob/main/examples/arm/ethos-u-setup/arm-none-eabi-gcc.cmake to target the Cortex-M4 platform (NUCLEO_WB55RG) and refined it with this patch https://github.com/ChristophKarlHeck/mbed-torch-fusion-os/blob/main/utils/patches/cmake_file_m4.patch Here are the build logs for

libextension_runner_util.a
libexecutorch.a
libportable_kernels.a

build_logs_basic_libs.txt I'd like to know why we compile all the operations in the basic libs? Therefore, I tried commenting out the unnecessary ones in the functions.yaml, but it didn't make a difference.

Here are the softmax build logs for

libarm_portable_ops_lib.a

build_softmax_logs.txt

Here are the build logs of the arm_executor_runner.cpp with aten::_softmax.out: build_arm_runner_softmax_logs.txt

Here are the add build logs for

libarm_portable_ops_lib.a

build_add_logs.txt

Here are the build logs of arm runner with aten::add.out: build_arm_runner_add_logs.txt

I will try to run it on the Corstone-300 FVP target tomorrow and keep you posted! Is there anything else I can do to find the error?

Sep 25 '24 13:09 ChristophKarlHeck

Hi @zingo, I am new to analyzing .map files. I used diff to compare them, but I couldn't find the error - the file that causes the size issue. What tool do you use to compare .map files? I had to zip the .map-files since each has 37.5MB. map_files.zip

Sep 25 '24 14:09 ChristophKarlHeck

I don't have a goof tool more then some sort of diff tool (I'm using meld) then I just try to see if there is a lot of something in one file that is not in the other. Or if some size have been changed a lot.

Looking at your map files I see that libportable_kernels.a seem to be something to check looking in the .text sections for the Softmax map file I only see

.text._ZZZZN5torch8executor6native11softmax_outERNS0_20KernelRuntimeContextERKNS0_6TensorExbRS4_ENKUlvE_clEvENKUlvE1_clEvENKUljjjE_clEjjj.isra.0
                0x080102e4      0x690 /home/chris/mbed-torch-fusion-os/executorch/cmake-out/kernels/portable/libportable_kernels.a(op_softmax.cpp.obj)
 .text._ZN5torch8executor14getLeadingDimsERKNS0_6TensorEx.part.0
                0x08010974       0x60 /home/chris/mbed-torch-fusion-os/executorch/cmake-out/kernels/portable/libportable_kernels.a(op_softmax.cpp.obj)
 .text._ZN5torch8executor6native11softmax_outERNS0_20KernelRuntimeContextERKNS0_6TensorExbRS4_
                0x080109d4      0xa88 /home/chris/mbed-torch-fusion-os/executorch/cmake-out/kernels/portable/libportable_kernels.a(op_softmax.cpp.obj)
                0x080109d4                torch::executor::native::softmax_out(torch::executor::KernelRuntimeContext&, torch::executor::Tensor const&, long long, bool, torch::executor::Tensor&)
 .text._ZN5torch8executor22check_log_softmax_argsERKNS0_6TensorExbRS1_
                0x0801145c      0x4d4 /home/chris/mbed-torch-fusion-os/executorch/cmake-out/kernels/portable/libportable_kernels.a(activation_ops_util.cpp.obj)
                0x0801145c                torch::executor::check_log_softmax_args(torch::executor::Tensor const&, long long, bool, torch::executor::Tensor&)
 .text._ZN5torch8executor18check_softmax_argsERKNS0_6TensorExbRS1_
                0x08011930        0xc /home/chris/mbed-torch-fusion-os/executorch/cmake-out/kernels/portable/libportable_kernels.a(activation_ops_util.cpp.obj)
                0x08011930                torch::executor::check_softmax_args(torch::executor::Tensor const&, long long, bool, torch::executor::Tensor&)

where the add has a really long list of things added (I cut out a lot and replace it with ... below)

 .text._ZN5torch8executor27apply_binary_elementwise_fnIbbaZZZZZZZZZNS0_6native7add_outERNS0_20KernelRuntimeContextERKNS0_6TensorES7_RKNS0_6ScalarERS5_ENKUlvE_clEvENKUlvE7_clEvENKUlvE_clEvENKUlvE7_clEvENKUlvE_clEvENKUlvE2_clEvENKUlvE_clEvENKUlvE0_clEvEUlbbE_EEvRKT2_S7_S7_S7_
                0x080101d4      0x1a8 /home/chris/mbed-torch-fusion-os/executorch/cmake-out/kernels/portable/libportable_kernels.a(op_add.cpp.obj)
 .text._ZN5torch8executor27apply_binary_elementwise_fnIaahZZZZZZZZZNS0_6native7add_outERNS0_20KernelRuntimeContextERKNS0_6TensorES7_RKNS0_6ScalarERS5_ENKUlvE_clEvENKUlvE0_clEvENKUlvE_clEvENKUlvE0_clEvENKUlvE_clEvENKUlvE1_clEvENKUlvE_clEvENKUlvE_clEvEUlaaE_EEvRKT2_S7_S7_S7_
                0x0801037c      0x1a8 /home/chris/mbed-torch-fusion-os/executorch/cmake-out/kernels/portable/libportable_kernels.a(op_add.cpp.obj)
 .text._ZN5torch8executor27apply_binary_elementwise_fnIhbaZZZZZZZZZNS0_6native7add_outERNS0_20KernelRuntimeContextERKNS0_6TensorES7_RKNS0_6ScalarERS5_ENKUlvE_clEvENKUlvE_clEvENKUlvE_clEvENKUlvE7_clEvENKUlvE_clEvENKUlvE1_clEvENKUlvE_clEvENKUlvE0_clEvEUlhbE_EEvRKT2_S7_S7_S7_
                0x08010524      0x1ac /home/chris/mbed-torch-fusion-os/executorch/cmake-out/kernels/portable/libportable_kernels.a(op_add.cpp.obj)
 .text._ZN5torch8executor27apply_binary_elementwise_fnIaaaZZZZZZZZZNS0_6native7add_outERNS0_20KernelRuntimeContextERKNS0_6TensorES7_RKNS0_6ScalarERS5_ENKUlvE_clEvENKUlvE0_clEvENKUlvE_clEvENKUlvE0_clEvENKUlvE_clEvENKUlvE1_clEvENKUlvE_clEvENKUlvE0_clEvEUlaaE_EEvRKT2_S7_S7_S7_
                0x080106d0      0x1a8 /home/chris/mbed-torch-fusion-os/executorch/cmake-out/kernels/portable/libportable_kernels.a(op_add.cpp.obj)
 .text._ZN5torch8executor27apply_binary_elementwise_fnIbbaZZZZZZZZZNS0_6native7add_outERNS0_20KernelRuntimeContextERKNS0_6TensorES7_RKNS0_6ScalarERS5_ENKUlvE_clEvENKUlvE7_clEvENKUlvE_clEvENKUlvE7_clEvENKUlvE_clEvENKUlvE1_clEvENKUlvE_clEvENKUlvE0_clEvEUlbbE_EEvRKT2_S7_S7_S7_
                0x08010878      0x1ac /home/chris/mbed-torch-fusion-os/executorch/cmake-out/kernels/portable/libportable_kernels.a(op_add.cpp.obj)
...
 .text._ZZZZZZN5torch8executor6native7add_outERNS0_20KernelRuntimeContextERKNS0_6TensorES6_RKNS0_6ScalarERS4_ENKUlvE_clEvENKUlvE_clEvENKUlvE_clEvENKUlvE2_clEvENKUlvE_clEv.isra.0
                0x08314f38     0x1ad8 /home/chris/mbed-torch-fusion-os/executorch/cmake-out/kernels/portable/libportable_kernels.a(op_add.cpp.obj)
 .text._ZZZZZZN5torch8executor6native7add_outERNS0_20KernelRuntimeContextERKNS0_6TensorES6_RKNS0_6ScalarERS4_ENKUlvE_clEvENKUlvE_clEvENKUlvE_clEvENKUlvE1_clEvENKUlvE_clEv.isra.0
                0x08316a10     0x1ad8 /home/chris/mbed-torch-fusion-os/executorch/cmake-out/kernels/portable/libportable_kernels.a(op_add.cpp.obj)
 .text._ZZZZZZN5torch8executor6native7add_outERNS0_20KernelRuntimeContextERKNS0_6TensorES6_RKNS0_6ScalarERS4_ENKUlvE_clEvENKUlvE_clEvENKUlvE_clEvENKUlvE0_clEvENKUlvE_clEv.isra.0
                0x083184e8     0x1ad8 /home/chris/mbed-torch-fusion-os/executorch/cmake-out/kernels/portable/libportable_kernels.a(op_add.cpp.obj)
 .text._ZZN5torch8executor6native7add_outERNS0_20KernelRuntimeContextERKNS0_6TensorES6_RKNS0_6ScalarERS4_ENKUlvE_clEv.isra.0
                0x08319fc0     0x15c4 /home/chris/mbed-torch-fusion-os/executorch/cmake-out/kernels/portable/libportable_kernels.a(op_add.cpp.obj)
 .text._ZN5torch8executor6native7add_outERNS0_20KernelRuntimeContextERKNS0_6TensorES6_RKNS0_6ScalarERS4_
                0x0831b584      0x2f0 /home/chris/mbed-torch-fusion-os/executorch/cmake-out/kernels/portable/libportable_kernels.a(op_add.cpp.obj)
                0x0831b584                torch::executor::native::add_out(torch::executor::KernelRuntimeContext&, torch::executor::Tensor const&, torch::executor::Tensor const&, torch::executor::Scalar const&, torch::executor::Tensor&)
 .text._ZN5torch8executor33tensors_are_broadcastable_betweenENS0_8ArrayRefIlEES2_
                0x0831b874       0x5c /home/chris/mbed-torch-fusion-os/executorch/cmake-out/kernels/portable/libportable_kernels.a(broadcast_util.cpp.obj)
                0x0831b874                torch::executor::tensors_are_broadcastable_between(torch::executor::ArrayRef<long>, torch::executor::ArrayRef<long>)
 .text._ZN5torch8executor25get_broadcast_target_sizeENS0_8ArrayRefIlEES2_PljPj
                0x0831b8d0      0x110 /home/chris/mbed-torch-fusion-os/executorch/cmake-out/kernels/portable/libportable_kernels.a(broadcast_util.cpp.obj)
                0x0831b8d0                torch::executor::get_broadcast_target_size(torch::executor::ArrayRef<long>, torch::executor::ArrayRef<long>, long*, unsigned int, unsigned int*)
 .text._ZN5torch8executor25get_broadcast_target_sizeERKNS0_6TensorES3_PljPj
                0x0831b9e0       0x38 /home/chris/mbed-torch-fusion-os/executorch/cmake-out/kernels/portable/libportable_kernels.a(broadcast_util.cpp.obj)
                0x0831b9e0                torch::executor::get_broadcast_target_size(torch::executor::Tensor const&, torch::executor::Tensor const&, long*, unsigned int, unsigned int*)
 .text._ZN5torch8executor17delinearize_indexEjNS0_8ArrayRefIlEEPjj
                0x0831ba18       0x80 /home/chris/mbed-torch-fusion-os/executorch/cmake-out/kernels/portable/libportable_kernels.a(broadcast_util.cpp.obj)
                0x0831ba18                torch::executor::delinearize_index(unsigned int, torch::executor::ArrayRef<long>, unsigned int*, unsigned int)
 .text._ZN5torch8executor17delinearize_indexEjRKNS0_6TensorEPjj
                0x0831ba98       0x28 /home/chris/mbed-torch-fusion-os/executorch/cmake-out/kernels/portable/libportable_kernels.a(broadcast_util.cpp.obj)
                0x0831ba98                torch::executor::delinearize_index(unsigned int, torch::executor::Tensor const&, unsigned int*, unsigned int)
 .text._ZN5torch8executor24linearize_access_indexesENS0_8ArrayRefIjEEiNS1_IlEES3_
                0x0831bac0       0xf0 /home/chris/mbed-torch-fusion-os/executorch/cmake-out/kernels/portable/libportable_kernels.a(broadcast_util.cpp.obj)
                0x0831bac0                torch::executor::linearize_access_indexes(torch::executor::ArrayRef<unsigned int>, int, torch::executor::ArrayRef<long>, torch::executor::ArrayRef<long>)
 .text._ZN5torch8executor24linearize_access_indexesENS0_8ArrayRefIjEEiRKNS0_6TensorE
                0x0831bbb0       0x40 /home/chris/mbed-torch-fusion-os/executorch/cmake-out/kernels/portable/libportable_kernels.a(broadcast_util.cpp.obj)
                0x0831bbb0                torch::executor::linearize_access_indexes(torch::executor::ArrayRef<unsigned int>, int, torch::executor::Tensor const&)
 .text._ZN5torch8executor16check_alpha_typeENS0_10ScalarTypeES1_
                0x0831bbf0       0x94 /home/chris/mbed-torch-fusion-os/executorch/cmake-out/kernels/portable/libportable_kernels.a(kernel_ops_util.cpp.obj)
                0x0831bbf0                torch::executor::check_alpha_type(torch::executor::ScalarType, torch::executor::ScalarType)

Most of it seem to be from op_add.cpp.obj it seem to "cost" almost 3Mb compare to the op_softmax.cpp.obj that only seem to cost about 47Kb
It seems the template expansions in op_add.cpp could be the problem.

Even if this don't help you directly I hope it gives some clues where to look more.

Sep 26 '24 10:09 zingo

Hi @zingo, Thank you for the help. It gives me some hints about where to look more. I'll update you about the outcome when I find the cause for the increase in the humongous size.

Sep 26 '24 14:09 ChristophKarlHeck

Most of it seem to be from op_add.cpp.obj it seem to "cost" almost 3Mb compare to the op_softmax.cpp.obj that only seem to cost about 4

If you compile a portable op it can be pretty large since it generates many dtype combinations. One way is to use selective build i.e. link only the op you want for the dype you want.

@larryliu0820 can you show how to specify dtype in a yaml? Also @Olivia-liu can help? :)

Oct 03 '24 01:10 digantdesai

Short update: I updated my ExecuTorch version to v0.3.0 and linked the libexecutorch_no_prim_ops.a to my target. Hence, I can use the 'aten::add.out' without a FLASH overflow. Besides that, it would be beneficial to know how to do the selective build since future work will use it. I appreciate any help you can provide.

Oct 03 '24 08:10 ChristophKarlHeck

Here's the README for selective builds. Hope it helps! https://github.com/pytorch/executorch/blob/main/examples/selective_build/README.md

Oct 08 '24 09:10 freddan80

Hi, I just spotted this recent merge: https://github.com/pytorch/executorch/pull/6006 I'm on my phone and cant check too much right now, so I'm not sure if this is the code you end up using for add, but could be worth to check if it improves your case.

Oct 10 '24 20:10 zingo

Hi @zingo, thank you very much! I will be on vacation the next two weeks then I'll check the improvements and let you know!

Oct 10 '24 21:10 ChristophKarlHeck

Hi @ChristophKarlHeck Is this a problem? Wonder if we can close this issue? :)

Jan 18 '25 20:01 zingo

Hi @zingo, we can close the issue :) sorry for not replying in time! Thank you

Jan 18 '25 21:01 ChristophKarlHeck

No problem, I'm a bit interested if there is something extra you figured out to make it smaller that would be good to know 😊

Jan 18 '25 23:01 zingo

I didn't figure it out since the change to executorch==0.3.0 has solved the problem. So, no further analysis was necessary :)

Jan 19 '25 05:01 ChristophKarlHeck

Thanks for the update.

Jan 19 '25 09:01 zingo