llvm `llvm-foreach` takes 100% cpu usage

Describe the bug

While building SYCL code with Intel oneAPI, I noticed that llvm-foreach is almost always sitting at 100% cpu usage.

top:

%Cpu(s):  8.5 us,  5.0 sy,  0.0 ni, 85.8 id,  0.1 wa,  0.0 hi,  0.6 si,  0.0 st
MiB Mem :  64023.7 total,  27107.4 free,   6165.8 used,  30750.5 buff/cache
MiB Swap:  32958.0 total,  32958.0 free,      0.0 used.  53965.6 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                                                                                                                           
  99440 fwyzard   20   0    4540   2176   2048 R  99.7   0.0   5:42.07 llvm-foreach                                                                                                                                                                                                      
 100326 fwyzard   20   0  309368 274324  48756 R  99.3   0.4   0:05.92 ocloc

ps -xf:

  98325 pts/2    S+     0:00  |   |   |                               \_ /opt/intel/oneapi/compiler/2024.1/bin/compiler/clang++ @/tmp/icpx0294253703WMgiHH/icpxargD9hFos
  99440 pts/2    R+     5:42  |   |   |                                   \_ /opt/intel/oneapi/compiler/2024.1/bin/compiler/llvm-foreach --out-ext=out --in-file-list=/tmp/icpx-ff969312fd/Activemask-tgllp-63b648.txt --in-replace=/tmp/icpx-ff969312fd/Activemask-tgllp-63b648.txt --ou
 100326 pts/2    R+     0:06  |   |   |                                       \_ /usr/bin/ocloc -output /tmp/Activemask-tgllp-e57dbd-65fea9.out -file /tmp/icpx-ff969312fd/Activemask-tgllp-63b648-0e09e1.spv -output_no_suffix -spirv_input -device tgllp -options -g -cl-opt-disable

This seems to happen for any backend. I've observed this consistently with oneAPI 2024.0 (based on LLVM 17) and 2024.2 (based on LLVM 19), running on Ubuntu Linux 22.04.

To reproduce

Build any complex program with ahead-of-time compilation for multiple backends, e.g. multiple Intel GPUs.

Environment

OS: Ubuntu Linux 22.04
Target device and vendor: any backend.
DPC++ version: Intel(R) oneAPI DPC++/C++ Compiler 2024.2.1 (2024.2.1.20240711)
Dependencies version: sycl-ls --verbose

[opencl:cpu][opencl:0] Intel(R) OpenCL, 11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz OpenCL 3.0 (Build 0) [2024.18.7.0.11_160000]
[opencl:gpu][opencl:1] Intel(R) OpenCL Graphics, Intel(R) UHD Graphics OpenCL 3.0 NEO  [24.22.29735.27]
[level_zero:gpu][level_zero:0] Intel(R) Level-Zero, Intel(R) UHD Graphics 1.3 [1.3.29735]
[cuda:gpu][cuda:0] NVIDIA CUDA BACKEND, NVIDIA GeForce RTX 3050 Ti Laptop GPU 8.6 [CUDA 12.6]

Platforms: 4
Platform [#1]:
    Version  : OpenCL 3.0 LINUX
    Name     : Intel(R) OpenCL
    Vendor   : Intel(R) Corporation
    Devices  : 1
        Device [#0]:
        Type       : cpu
        Version    : OpenCL 3.0 (Build 0)
        Name       : 11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz
        Vendor     : Intel(R) Corporation
        Driver     : 2024.18.7.0.11_160000
        Aspects    : cpu fp16 fp64 online_compiler online_linker queue_profiling usm_device_allocations usm_host_allocations usm_shared_allocations usm_system_allocations usm_atomic_host_allocations usm_atomic_shared_allocations atomic64 ext_oneapi_srgb ext_oneapi_native_assert ext_intel_legacy_image ext_oneapi_ballot_group ext_oneapi_fixed_size_group ext_oneapi_opportunistic_group ext_oneapi_tangle_group
        info::device::sub_group_sizes: 4 8 16 32 64
Platform [#2]:
    Version  : OpenCL 3.0 
    Name     : Intel(R) OpenCL Graphics
    Vendor   : Intel(R) Corporation
    Devices  : 1
        Device [#1]:
        Type       : gpu
        Version    : OpenCL 3.0 NEO 
        Name       : Intel(R) UHD Graphics
        Vendor     : Intel(R) Corporation
        Driver     : 24.22.29735.27
        Aspects    : gpu fp16 online_compiler online_linker queue_profiling usm_device_allocations usm_host_allocations usm_shared_allocations atomic64 ext_oneapi_srgb ext_intel_device_id ext_intel_legacy_image ext_intel_esimd ext_oneapi_ballot_group ext_oneapi_fixed_size_group ext_oneapi_opportunistic_group ext_oneapi_tangle_group
        info::device::sub_group_sizes: 8 16 32
Platform [#3]:
    Version  : 1.3
    Name     : Intel(R) Level-Zero
    Vendor   : Intel(R) Corporation
    Devices  : 1
        Device [#0]:
        Type       : gpu
        Version    : 1.3
        Name       : Intel(R) UHD Graphics
        Vendor     : Intel(R) Corporation
        Driver     : 1.3.29735
        Aspects    : gpu fp16 online_compiler online_linker queue_profiling usm_device_allocations usm_host_allocations usm_shared_allocations ext_intel_pci_address ext_intel_gpu_eu_count ext_intel_gpu_eu_simd_width ext_intel_gpu_slices ext_intel_gpu_subslices_per_slice ext_intel_gpu_eu_count_per_subslice atomic64 ext_intel_device_info_uuid ext_intel_gpu_hw_threads_per_eu ext_intel_device_id ext_intel_memory_clock_rate ext_intel_memory_bus_width ext_intel_legacy_image ext_oneapi_bindless_images ext_oneapi_bindless_images_shared_usm ext_oneapi_bindless_images_2d_usm ext_oneapi_mipmap ext_oneapi_mipmap_anisotropy ext_intel_esimd ext_oneapi_ballot_group ext_oneapi_fixed_size_group ext_oneapi_opportunistic_group ext_oneapi_tangle_group ext_oneapi_graph
        info::device::sub_group_sizes: 8 16 32
Platform [#4]:
    Version  : CUDA 12.6
    Name     : NVIDIA CUDA BACKEND
    Vendor   : NVIDIA Corporation
    Devices  : 1
        Device [#0]:
        Type       : gpu
        Version    : 8.6
        Name       : NVIDIA GeForce RTX 3050 Ti Laptop GPU
        Vendor     : NVIDIA Corporation
        Driver     : CUDA 12.6
        Aspects    : gpu fp16 fp64 online_compiler online_linker queue_profiling usm_device_allocations usm_host_allocations usm_shared_allocations usm_system_allocations ext_intel_pci_address usm_atomic_host_allocations usm_atomic_shared_allocations atomic64 ext_intel_device_info_uuid ext_oneapi_native_assert ext_oneapi_bfloat16_math_functions ext_intel_free_memory ext_intel_device_id ext_intel_memory_clock_rate ext_intel_memory_bus_widthur_print: Images are not fully supported by the CUDA BE, their support is disabled by default. Their partial support can be activated by setting SYCL_PI_CUDA_ENABLE_IMAGE_SUPPORT environment variable at runtime.
 ext_oneapi_bindless_images ext_oneapi_bindless_images_shared_usm ext_oneapi_bindless_images_2d_usm ext_oneapi_interop_memory_import ext_oneapi_interop_semaphore_import ext_oneapi_mipmap ext_oneapi_mipmap_anisotropy ext_oneapi_mipmap_level_reference ext_oneapi_ballot_group ext_oneapi_fixed_size_group ext_oneapi_opportunistic_group ext_oneapi_graph ext_oneapi_cubemap ext_oneapi_cubemap_seamless_filtering
        info::device::sub_group_sizes: 32
default_selector()      : gpu, Intel(R) Level-Zero, Intel(R) UHD Graphics 1.3 [1.3.29735]
accelerator_selector()  : No device of requested type available. Please chec...
cpu_selector()          : cpu, Intel(R) OpenCL, 11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz OpenCL 3.0 (Build 0) [2024.18.7.0.11_160000]
gpu_selector()          : gpu, Intel(R) Level-Zero, Intel(R) UHD Graphics 1.3 [1.3.29735]
custom_selector(gpu)    : gpu, Intel(R) Level-Zero, Intel(R) UHD Graphics 1.3 [1.3.29735]
custom_selector(cpu)    : cpu, Intel(R) OpenCL, 11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz OpenCL 3.0 (Build 0) [2024.18.7.0.11_160000]
custom_selector(acc)    : No device of requested type available. Please chec...

Additional context

No response

Aug 22 '24 23:08 fwyzard

@fwyzard, the problem is ocloc tool. llvm-foreach just a simple launcher runs commands from a file and waits for them to complete. You can check the logic here - it's ~200 lines of code.

NOTE: ocloc tool is being developed in https://github.com/intel/intel-graphics-compiler/, so I would transfer this issue there.

Aug 22 '24 23:08 bader

@bader while it would definitely be nice if ocloc were faster, the issue is that llvm-foreach itself takes 100% cpu, in addition to ocloc taking up 100% cpu (on another core4).

Instead of tightly looping, would it be possible to make llvm-foreach sleep until a subprocess complete ? Or, at least, something like sleeping 100ms between each check ?

Aug 22 '24 23:08 fwyzard

I think we are going to new remove this tool soon. We are refactoring the compilation process for offload code and new approach won't use this tool or similar approach to detect the task completion. @asudarsa, @maksimsab, @sarnex, FYI.

Aug 22 '24 23:08 bader

@ivorobts FYI

Aug 23 '24 07:08 fwyzard

I think we are going to new remove this tool soon. We are refactoring the compilation process for offload code and new approach won't use this tool or similar approach to detect the task completion. @asudarsa, @maksimsab, @sarnex, FYI.

Yes. We are in the process of adding support for '--offload-new-driver' flag that can be used for SYCL offloading apps. This will trigger a compilation flow that will not use 'llvm-foreach' tool. For 'ocloc' issue, https://github.com/intel/intel-graphics-compiler/ will be a better place to report this. However, this behavior of nearly 100% utilization of cpu during the AOT stage is not something we expect. I will try to confirm this on my end.

Thanks for the report.

Sep 25 '24 15:09 asudarsa

Hi! There have been no updates for at least the last 60 days, though the issue has assignee(s).

@asudarsa, could you please take one of the following actions:

provide an update if you have any
unassign yourself if you're not looking / going to look into this issue
mark this issue with the 'confirmed' label if you have confirmed the problem/request and our team should work on it
close the issue if it has been resolved
take any other suitable action.

Thanks!

Nov 25 '24 00:11 github-actions[bot]

I am looking into this now. Thanks

Feb 27 '25 16:02 asudarsa

llvm llvm copied to clipboard

`llvm-foreach` takes 100% cpu usage

Describe the bug

top:

ps -xf:

To reproduce

Environment

Additional context

llvm
llvm copied to clipboard