SIGBUS during memcpy when trying to use level_zero:gpu while opencl:gpu works
I am trying to use llama.cpp with SYCL and when running with default settings I'm getting a "Bus error" (SIGBUS) when loading models:
$ ./bin/llama-bench -m models/phi-4-Q3_K_M.gguf
WARNING: Small BAR detected for device 0000:03:00.0
| model | size | params | backend | ngl | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
Bus error (core dumped) ./bin/llama-bench -m models/phi-4-Q3_K_M.gguf
That is using the level_zero device by default. When using the OpenCL version using ONEAPI_DEVICE_SELECTOR the code works fine:
$ ONEAPI_DEVICE_SELECTOR=opencl:gpu ./bin/llama-bench -m models/phi-4-Q3_K_M.gguf
WARNING: Small BAR detected for device 0000:03:00.0
| model | size | params | backend | ngl | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| llama 13B Q3_K - Medium | 6.69 GiB | 14.66 B | SYCL | 99 | pp512 | 333.50 ± 20.98 |
...
I'm aware of the "small bar" warning as I'm running on older hardware (Asus Z170-A + Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz) with Arc A750:
$ sycl-ls
WARNING: Small BAR detected for device 0000:03:00.0
WARNING: Small BAR detected for device 0000:03:00.0
[level_zero:gpu][level_zero:0] Intel(R) oneAPI Unified Runtime over Level-Zero, Intel(R) Arc(TM) A750 Graphics 12.55.8 [1.6.34666]
[opencl:fpga][opencl:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2 [2024.18.12.0.05_160000]
[opencl:cpu][opencl:1] Intel(R) OpenCL, Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz OpenCL 3.0 (Build 0) [2024.18.12.0.05_160000]
[opencl:gpu][opencl:2] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A750 Graphics OpenCL 3.0 NEO [25.31.34666]
I'm using Arch Linux and the latest version of intel-compute-runtime I could find:
$ uname -a
Linux hostname 6.16.1-arch1-1 #1 SMP PREEMPT_DYNAMIC Fri, 15 Aug 2025 16:04:43 +0000 x86_64 GNU/Linux
$ pacman -Q intel-compute-runtime
intel-compute-runtime 25.31.34666.3-1
Some more crash details with GDB:
$ gdb --args ./bin/llama-bench -m models/phi-4-Q3_K_M.gguf
...
Thread 1 "llama-bench" received signal SIGBUS, Bus error.
(gdb) bt full
#0 0x00007ffff636e087 in ?? () from /usr/lib/libc.so.6
No symbol table info available.
#1 0x00007fffdcab128c in memcpy_s (dst=0x7ffcfaa60000, destSize=<optimized out>, src=0x5d556b0, count=<optimized out>)
at /src/arch/intel-compute-runtime/src/compute-runtime-25.31.34666.3/shared/source/helpers/string.h:71
No locals.
#2 L0::CommandListCoreFamilyImmediate<(GFXCORE_FAMILY)3079>::performCpuMemcpy (this=this@entry=0x5d5a6c0, cpuMemCopyInfo=...,
hSignalEvent=hSignalEvent@entry=0x2cd9e18, numWaitEvents=numWaitEvents@entry=0, phWaitEvents=phWaitEvents@entry=0x0)
at /src/arch/intel-compute-runtime/src/compute-runtime-25.31.34666.3/level_zero/core/source/cmdlist/cmdlist_hw_immediate.inl:1444
lockingFailed = false
srcLockPointer = <optimized out>
dstLockPointer = <optimized out>
signalEvent = 0x2cd9e10
cpuMemcpySrcPtr = 0x5d556b0
cpuMemcpyDstPtr = 0x7ffcfaa60000
#3 0x00007fffdcabf240 in L0::CommandListCoreFamilyImmediate<(GFXCORE_FAMILY)3079>::appendMemoryCopy (this=0x5d5a6c0, dstptr=0xffffd556aaa00000,
srcptr=0x5d556b0, size=20480, hSignalEvent=0x2cd9e18, numWaitEvents=0, phWaitEvents=0x0, memoryCopyParams=...)
at /src/arch/intel-compute-runtime/src/compute-runtime-25.31.34666.3/level_zero/core/source/cmdlist/cmdlist_hw_immediate.inl:683
estimatedSize = <optimized out>
hasStallindCmds = false
ret = <optimized out>
cpuMemCopyInfo = {dstPtr = 0xffffd556aaa00000, srcPtr = 0x5d556b0, size = 20480, dstAllocData = 0x5c78bc0, srcAllocData = 0x0,
dstIsImportedHostPtr = false, srcIsImportedHostPtr = false}
direction = 32767
isSplitNeeded = <optimized out>
#4 0x00007fffdc8e35f2 in L0::zeCommandListAppendMemoryCopy (hCommandList=<optimized out>, dstptr=<optimized out>, srcptr=<optimized out>, size=20480,
hSignalEvent=0x2cd9e18, numWaitEvents=<optimized out>, phWaitEvents=0x0)
at /src/arch/intel-compute-runtime/src/compute-runtime-25.31.34666.3/level_zero/api/core/ze_copy_api_entrypoints.h:32
cmdList = 0x5d5a6c0
ret = ZE_RESULT_ERROR_NOT_AVAILABLE
memoryCopyParams = {relaxedOrderingDispatch = false, forceDisableCopyOnlyInOrderSignaling = false, copyOffloadAllowed = false}
#5 0x00007fffee78237b in enqueueMemCopyHelper(ur_command_t, ur_queue_handle_legacy_t_*, void*, unsigned char, unsigned long, void const*, unsigned int, ur_event_handle_t_* const*, ur_event_handle_t_**, bool) () from /opt/intel/oneapi/compiler/2025.0/lib/libur_adapter_level_zero.so.0
No symbol table info available.
#6 0x00007fffee78bf87 in ur_queue_handle_legacy_t_::enqueueUSMMemcpy(bool, void*, void const*, unsigned long, unsigned int, ur_event_handle_t_* const*, ur_event_handle_t_**) () from /opt/intel/oneapi/compiler/2025.0/lib/libur_adapter_level_zero.so.0
No symbol table info available.
#7 0x00007fffe0cf4db7 in ur_loader::urEnqueueUSMMemcpy(ur_queue_handle_t_*, bool, void*, void const*, unsigned long, unsigned int, ur_event_handle_t_* const*, ur_event_handle_t_**) () from /opt/intel/oneapi/compiler/2025.0/lib/libur_loader.so.0
No symbol table info available.
#8 0x00007fffe0d07eff in urEnqueueUSMMemcpy () from /opt/intel/oneapi/compiler/2025.0/lib/libur_loader.so.0
No symbol table info available.
#9 0x00007fffe1c45aa9 in sycl::_V1::detail::MemoryManager::copy_usm(void const*, std::shared_ptr<sycl::_V1::detail::queue_impl>, unsigned long, void*, std::vector<ur_event_handle_t_*, std::allocator<ur_event_handle_t_*> >, ur_event_handle_t_**, std::shared_ptr<sycl::_V1::detail::event_impl> const&) ()
from /opt/intel/oneapi/compiler/2025.0/lib/libsycl.so.8
No symbol table info available.
#10 0x00007fffe1c8b83d in sycl::_V1::detail::queue_impl::memcpy(std::shared_ptr<sycl::_V1::detail::queue_impl> const&, void*, void const*, unsigned long, std::vector<sycl::_V1::event, std::allocator<sycl::_V1::event> > const&, bool, sycl::_V1::detail::code_location const&) ()
from /opt/intel/oneapi/compiler/2025.0/lib/libsycl.so.8
No symbol table info available.
#11 0x00007fffe1d36421 in sycl::_V1::queue::memcpy(void*, void const*, unsigned long, sycl::_V1::detail::code_location const&) ()
from /opt/intel/oneapi/compiler/2025.0/lib/libsycl.so.8
No symbol table info available.
#12 0x00007ffff6a37f9d in ggml_backend_sycl_buffer_set_tensor(ggml_backend_buffer*, ggml_tensor*, void const*, unsigned long, unsigned long)::{lambda()#2}::operator()() const (this=<optimized out>) at /src/llama.cpp/ggml/src/ggml-sycl/ggml-sycl.cpp:399
e = <optimized out>
(gdb) list
...
1444 memcpy_s(cpuMemcpyDstPtr, cpuMemCopyInfo.size, cpuMemcpySrcPtr, cpuMemCopyInfo.size);
...
(gdb) p cpuMemCopyInfo
$7 = (const L0::CpuMemCopyInfo &) @0x7fffffffaea0: {dstPtr = 0xffffd556aaa00000, srcPtr = 0x5d556b0, size = 20480, dstAllocData = 0x5c78bc0,
srcAllocData = 0x0, dstIsImportedHostPtr = false, srcIsImportedHostPtr = false}
(gdb) p cpuMemcpyDstPtr
$8 = (void *) 0x7ffcfaa60000
(gdb) info proc mappings
Mapped address spaces:
Start Addr End Addr Size Offset Perms File
...
0x00000000004d5000 0x0000000005d6a000 0x5895000 0x0 rw-p [heap]
0x00007ffcfaa60000 0x00007ffdf9000000 0xfe5a0000 0x1e929a000 rw-s anon_inode:i915.gem
0x00007ffdf9000000 0x00007fffa59b7000 0x1ac9b7000 0x0 r--s /data/llama-models/phi-4-Q3_K_M.gguf
0x00007fffa5a00000 0x00007fffa5a3f000 0x3f000 0x0 r--p /usr/lib/libopencl-clang.so.15
...
While I understand the issue might be due to the "small BAR" error, I would appreciate a helpful error message rather than a SIGBUS that requires rebuilding the intel-compute-runtime with debug symbols to understand where the issue is coming from. Even better - make level_zero work with small BAR, even if with reduced performance.
What is the outcome if you run with those 2 env variables: ExperimentalCopyThroughLock=0 NEOReadDebugKeys=1
Thank you, it works! I never came across ExperimentalCopyThroughLock=0 while looking for solutions.
Not only level_zero now works, it's also faster!
$ ONEAPI_DEVICE_SELECTOR=opencl:gpu ./b/bin/llama-bench -m models/SmolLM3-Q4_K_M.gguf
WARNING: Small BAR detected for device 0000:03:00.0
| model | size | params | backend | ngl | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
| smollm3 3B Q4_K - Medium | 1.78 GiB | 3.08 B | SYCL | 99 | pp512 | 1025.75 ± 253.95 |
| smollm3 3B Q4_K - Medium | 1.78 GiB | 3.08 B | SYCL | 99 | tg128 | 8.60 ± 0.50 |
$ ExperimentalCopyThroughLock=0 NEOReadDebugKeys=1 ONEAPI_DEVICE_SELECTOR=level_zero:gpu ./b/bin/llama-bench -m models/SmolLM3-Q4_K_M.gguf
WARNING: Small BAR detected for device 0000:03:00.0
| model | size | params | backend | ngl | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| smollm3 3B Q4_K - Medium | 1.78 GiB | 3.08 B | SYCL | 99 | pp512 | 1429.51 ± 133.59 |
| smollm3 3B Q4_K - Medium | 1.78 GiB | 3.08 B | SYCL | 99 | tg128 | 19.06 ± 0.86 |
Thanks for checking, so whole workload works while this is applied ?
I can confirm it also works for whole workload (evaluating prompts through llama-cli and llama-server), though I have not been able to observe such optimistic speedups going from opencl to level_zero in the benchmark.
Hi @mkottman,
Thank you for your contribution.
We have further processed this issue internally and have prepared a fix that is already merged in the latest GitHub release. The debug flags should no longer be needed with this update (https://github.com/intel/compute-runtime/releases/tag/25.44.36015.5).
Could you please confirm that everything is working correctly on your side? If there’s anything else you think we should cover in this issue, please let us know. Otherwise, if your workload is functioning as expected, we’d like to proceed and close the issue.
Thanks again for your help and feedback.