compute-runtime SIGBUS during memcpy when trying to use level

I am trying to use llama.cpp with SYCL and when running with default settings I'm getting a "Bus error" (SIGBUS) when loading models:

$ ./bin/llama-bench -m models/phi-4-Q3_K_M.gguf
WARNING: Small BAR detected for device 0000:03:00.0
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
Bus error                  (core dumped) ./bin/llama-bench -m models/phi-4-Q3_K_M.gguf

That is using the level_zero device by default. When using the OpenCL version using ONEAPI_DEVICE_SELECTOR the code works fine:

$ ONEAPI_DEVICE_SELECTOR=opencl:gpu ./bin/llama-bench -m models/phi-4-Q3_K_M.gguf
WARNING: Small BAR detected for device 0000:03:00.0
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| llama 13B Q3_K - Medium        |   6.69 GiB |    14.66 B | SYCL       |  99 |           pp512 |       333.50 ± 20.98 |
...

I'm aware of the "small bar" warning as I'm running on older hardware (Asus Z170-A + Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz) with Arc A750:

$ sycl-ls
WARNING: Small BAR detected for device 0000:03:00.0
WARNING: Small BAR detected for device 0000:03:00.0
[level_zero:gpu][level_zero:0] Intel(R) oneAPI Unified Runtime over Level-Zero, Intel(R) Arc(TM) A750 Graphics 12.55.8 [1.6.34666]
[opencl:fpga][opencl:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2024.18.12.0.05_160000]
[opencl:cpu][opencl:1] Intel(R) OpenCL, Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz OpenCL 3.0 (Build 0) [2024.18.12.0.05_160000]
[opencl:gpu][opencl:2] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A750 Graphics OpenCL 3.0 NEO  [25.31.34666]

I'm using Arch Linux and the latest version of intel-compute-runtime I could find:

$ uname -a
Linux hostname 6.16.1-arch1-1 #1 SMP PREEMPT_DYNAMIC Fri, 15 Aug 2025 16:04:43 +0000 x86_64 GNU/Linux
$ pacman -Q intel-compute-runtime
intel-compute-runtime 25.31.34666.3-1

Some more crash details with GDB:

$ gdb --args ./bin/llama-bench -m models/phi-4-Q3_K_M.gguf
...
Thread 1 "llama-bench" received signal SIGBUS, Bus error.
(gdb) bt full
#0  0x00007ffff636e087 in ?? () from /usr/lib/libc.so.6
No symbol table info available.
#1  0x00007fffdcab128c in memcpy_s (dst=0x7ffcfaa60000, destSize=<optimized out>, src=0x5d556b0, count=<optimized out>)
    at /src/arch/intel-compute-runtime/src/compute-runtime-25.31.34666.3/shared/source/helpers/string.h:71
No locals.
#2  L0::CommandListCoreFamilyImmediate<(GFXCORE_FAMILY)3079>::performCpuMemcpy (this=this@entry=0x5d5a6c0, cpuMemCopyInfo=...,
    hSignalEvent=hSignalEvent@entry=0x2cd9e18, numWaitEvents=numWaitEvents@entry=0, phWaitEvents=phWaitEvents@entry=0x0)
    at /src/arch/intel-compute-runtime/src/compute-runtime-25.31.34666.3/level_zero/core/source/cmdlist/cmdlist_hw_immediate.inl:1444
        lockingFailed = false
        srcLockPointer = <optimized out>
        dstLockPointer = <optimized out>
        signalEvent = 0x2cd9e10
        cpuMemcpySrcPtr = 0x5d556b0
        cpuMemcpyDstPtr = 0x7ffcfaa60000
#3  0x00007fffdcabf240 in L0::CommandListCoreFamilyImmediate<(GFXCORE_FAMILY)3079>::appendMemoryCopy (this=0x5d5a6c0, dstptr=0xffffd556aaa00000,
    srcptr=0x5d556b0, size=20480, hSignalEvent=0x2cd9e18, numWaitEvents=0, phWaitEvents=0x0, memoryCopyParams=...)
    at /src/arch/intel-compute-runtime/src/compute-runtime-25.31.34666.3/level_zero/core/source/cmdlist/cmdlist_hw_immediate.inl:683
        estimatedSize = <optimized out>
        hasStallindCmds = false
        ret = <optimized out>
        cpuMemCopyInfo = {dstPtr = 0xffffd556aaa00000, srcPtr = 0x5d556b0, size = 20480, dstAllocData = 0x5c78bc0, srcAllocData = 0x0,
          dstIsImportedHostPtr = false, srcIsImportedHostPtr = false}
        direction = 32767
        isSplitNeeded = <optimized out>
#4  0x00007fffdc8e35f2 in L0::zeCommandListAppendMemoryCopy (hCommandList=<optimized out>, dstptr=<optimized out>, srcptr=<optimized out>, size=20480,
    hSignalEvent=0x2cd9e18, numWaitEvents=<optimized out>, phWaitEvents=0x0)
    at /src/arch/intel-compute-runtime/src/compute-runtime-25.31.34666.3/level_zero/api/core/ze_copy_api_entrypoints.h:32
        cmdList = 0x5d5a6c0
        ret = ZE_RESULT_ERROR_NOT_AVAILABLE
        memoryCopyParams = {relaxedOrderingDispatch = false, forceDisableCopyOnlyInOrderSignaling = false, copyOffloadAllowed = false}
#5  0x00007fffee78237b in enqueueMemCopyHelper(ur_command_t, ur_queue_handle_legacy_t_*, void*, unsigned char, unsigned long, void const*, unsigned int, ur_event_handle_t_* const*, ur_event_handle_t_**, bool) () from /opt/intel/oneapi/compiler/2025.0/lib/libur_adapter_level_zero.so.0
No symbol table info available.
#6  0x00007fffee78bf87 in ur_queue_handle_legacy_t_::enqueueUSMMemcpy(bool, void*, void const*, unsigned long, unsigned int, ur_event_handle_t_* const*, ur_event_handle_t_**) () from /opt/intel/oneapi/compiler/2025.0/lib/libur_adapter_level_zero.so.0
No symbol table info available.
#7  0x00007fffe0cf4db7 in ur_loader::urEnqueueUSMMemcpy(ur_queue_handle_t_*, bool, void*, void const*, unsigned long, unsigned int, ur_event_handle_t_* const*, ur_event_handle_t_**) () from /opt/intel/oneapi/compiler/2025.0/lib/libur_loader.so.0
No symbol table info available.
#8  0x00007fffe0d07eff in urEnqueueUSMMemcpy () from /opt/intel/oneapi/compiler/2025.0/lib/libur_loader.so.0
No symbol table info available.
#9  0x00007fffe1c45aa9 in sycl::_V1::detail::MemoryManager::copy_usm(void const*, std::shared_ptr<sycl::_V1::detail::queue_impl>, unsigned long, void*, std::vector<ur_event_handle_t_*, std::allocator<ur_event_handle_t_*> >, ur_event_handle_t_**, std::shared_ptr<sycl::_V1::detail::event_impl> const&) ()
   from /opt/intel/oneapi/compiler/2025.0/lib/libsycl.so.8
No symbol table info available.
#10 0x00007fffe1c8b83d in sycl::_V1::detail::queue_impl::memcpy(std::shared_ptr<sycl::_V1::detail::queue_impl> const&, void*, void const*, unsigned long, std::vector<sycl::_V1::event, std::allocator<sycl::_V1::event> > const&, bool, sycl::_V1::detail::code_location const&) ()
   from /opt/intel/oneapi/compiler/2025.0/lib/libsycl.so.8
No symbol table info available.
#11 0x00007fffe1d36421 in sycl::_V1::queue::memcpy(void*, void const*, unsigned long, sycl::_V1::detail::code_location const&) ()
   from /opt/intel/oneapi/compiler/2025.0/lib/libsycl.so.8
No symbol table info available.
#12 0x00007ffff6a37f9d in ggml_backend_sycl_buffer_set_tensor(ggml_backend_buffer*, ggml_tensor*, void const*, unsigned long, unsigned long)::{lambda()#2}::operator()() const (this=<optimized out>) at /src/llama.cpp/ggml/src/ggml-sycl/ggml-sycl.cpp:399
        e = <optimized out>
(gdb) list
...
1444        memcpy_s(cpuMemcpyDstPtr, cpuMemCopyInfo.size, cpuMemcpySrcPtr, cpuMemCopyInfo.size);
...
(gdb) p cpuMemCopyInfo
$7 = (const L0::CpuMemCopyInfo &) @0x7fffffffaea0: {dstPtr = 0xffffd556aaa00000, srcPtr = 0x5d556b0, size = 20480, dstAllocData = 0x5c78bc0,
  srcAllocData = 0x0, dstIsImportedHostPtr = false, srcIsImportedHostPtr = false}
(gdb) p cpuMemcpyDstPtr
$8 = (void *) 0x7ffcfaa60000
(gdb) info proc mappings
Mapped address spaces:

Start Addr         End Addr           Size               Offset             Perms File
...
0x00000000004d5000 0x0000000005d6a000 0x5895000          0x0                rw-p  [heap]
0x00007ffcfaa60000 0x00007ffdf9000000 0xfe5a0000         0x1e929a000        rw-s  anon_inode:i915.gem
0x00007ffdf9000000 0x00007fffa59b7000 0x1ac9b7000        0x0                r--s  /data/llama-models/phi-4-Q3_K_M.gguf
0x00007fffa5a00000 0x00007fffa5a3f000 0x3f000            0x0                r--p  /usr/lib/libopencl-clang.so.15
...

While I understand the issue might be due to the "small BAR" error, I would appreciate a helpful error message rather than a SIGBUS that requires rebuilding the intel-compute-runtime with debug symbols to understand where the issue is coming from. Even better - make level_zero work with small BAR, even if with reduced performance.

Aug 22 '25 08:08 mkottman

What is the outcome if you run with those 2 env variables: ExperimentalCopyThroughLock=0 NEOReadDebugKeys=1

Aug 22 '25 08:08 MichalMrozek

Thank you, it works! I never came across ExperimentalCopyThroughLock=0 while looking for solutions.

Not only level_zero now works, it's also faster!

$ ONEAPI_DEVICE_SELECTOR=opencl:gpu ./b/bin/llama-bench -m models/SmolLM3-Q4_K_M.gguf
WARNING: Small BAR detected for device 0000:03:00.0
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
| smollm3 3B Q4_K - Medium       |   1.78 GiB |     3.08 B | SYCL       |  99 |           pp512 |     1025.75 ± 253.95 |
| smollm3 3B Q4_K - Medium       |   1.78 GiB |     3.08 B | SYCL       |  99 |           tg128 |          8.60 ± 0.50 |

$ ExperimentalCopyThroughLock=0 NEOReadDebugKeys=1 ONEAPI_DEVICE_SELECTOR=level_zero:gpu ./b/bin/llama-bench -m models/SmolLM3-Q4_K_M.gguf
WARNING: Small BAR detected for device 0000:03:00.0
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| smollm3 3B Q4_K - Medium       |   1.78 GiB |     3.08 B | SYCL       |  99 |           pp512 |     1429.51 ± 133.59 |
| smollm3 3B Q4_K - Medium       |   1.78 GiB |     3.08 B | SYCL       |  99 |           tg128 |         19.06 ± 0.86 |

Aug 22 '25 09:08 mkottman

Thanks for checking, so whole workload works while this is applied ?

Aug 22 '25 09:08 MichalMrozek

I can confirm it also works for whole workload (evaluating prompts through llama-cli and llama-server), though I have not been able to observe such optimistic speedups going from opencl to level_zero in the benchmark.

Aug 22 '25 12:08 mkottman

Hi @mkottman,

Thank you for your contribution.

We have further processed this issue internally and have prepared a fix that is already merged in the latest GitHub release. The debug flags should no longer be needed with this update (https://github.com/intel/compute-runtime/releases/tag/25.44.36015.5).

Could you please confirm that everything is working correctly on your side? If there’s anything else you think we should cover in this issue, please let us know. Otherwise, if your workload is functioning as expected, we’d like to proceed and close the issue.

Thanks again for your help and feedback.

Nov 27 '25 05:11 kgibala

SIGBUS during memcpy when trying to use level_zero:gpu while opencl:gpu works