ipex-llm icon indicating copy to clipboard operation
ipex-llm copied to clipboard

GPU Runner crash in Ollama when offloading multiple layers

Open pauleseifert opened this issue 1 year ago • 6 comments

Hi,

I experience crashes of the gpu runner when offloading multiple layers to the gpu.

time=2024-12-09T00:58:03.646+08:00 level=INFO source=server.go:629 msg="waiting for server to become available" status="llm server not responding"
time=2024-12-09T00:58:04.348+08:00 level=ERROR source=sched.go:456 msg="error loading llama server" error="llama runner process has terminated: signal: bus error (core dumped)"
[GIN] 2024/12/09 - 00:58:04 | 500 |  1.520528721s |      172.16.6.3 | POST     "/api/chat"
time=2024-12-09T00:58:04.348+08:00 level=DEBUG source=sched.go:459 msg="triggering expiration for failed load" model=/root/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff

It seems to work for one layer. The error message is not really helpful. The GPU is small (4gb A310) but so is the model (llama [email protected] params., 1.87 GiB model size). VRAM shouldn't be the problem.

I use docker on Debian on kernel 6.6.44 with the following docker compose:

  ipex-llm:
    image: intelanalytics/ipex-llm-inference-cpp-xpu:latest
    container_name: ollama
    restart: unless-stopped
    networks:
       - backend
    command: >
      /bin/bash -c "
        sycl-ls &&
        source ipex-llm-init --gpu --device Arc &&

        bash ./scripts/start-ollama.sh && # run the scripts
        kill $(pgrep -f ollama) && # kill background ollama
        /llm/ollama/ollama serve # run foreground ollama
      "
    devices:
      - /dev/dri
    volumes:
      - /dev/dri:/dev/dri
      - /mnt/fast_storage/docker/ollama:/root/.ollama
    environment:
      DEVICE: Arc
      NEOReadDebugKeys: 1
      OverrideGpuAddressSpace: 48
      ZES_ENABLE_SYSMAN: 1
      OLLAMA_DEBUG: 1
      #OLLAMA_INTEL_GPU: 1
      OLLAMA_NUM_PARALLEL: 1
      OLLAMA_HOST: 0.0.0.0
      OLLAMA_NUM_GPU: 999 # layers to offload -> this is the problem 
      SYCL_CACHE_PERSISTENT: 1
      SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS: 1
      ONEAPI_DEVICE_SELECTOR: level_zero=gpu:0 

Any ideas for further debugging? Full logs below.

Warning: ONEAPI_DEVICE_SELECTOR environment variable is set to level_zero=gpu:0.
To see the correct device id, please unset ONEAPI_DEVICE_SELECTOR.
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Arc(TM) A310 LP Graphics 1.6 [1.3.31294]
found oneapi in /opt/intel/oneapi/setvars.sh
 
:: initializing oneAPI environment ...
   bash: BASH_VERSION = 5.1.16(1)-release
   args: Using "$@" for setvars.sh arguments: --force
:: advisor -- latest
:: ccl -- latest
:: compiler -- latest
:: dal -- latest
:: debugger -- latest
:: dev-utilities -- latest
:: dnnl -- latest
:: dpcpp-ct -- latest
:: dpl -- latest
:: ipp -- latest
:: ippcp -- latest
:: mkl -- latest
:: mpi -- latest
:: tbb -- latest
:: vtune -- latest
:: oneAPI environment initialized ::
 
/usr/local/lib/python3.11/dist-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _torch_pytree._register_pytree_node(
/usr/local/lib/python3.11/dist-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
/usr/local/lib/python3.11/dist-packages/transformers/utils/generic.py:309: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _torch_pytree._register_pytree_node(
/usr/local/lib/python3.11/dist-packages/transformers/utils/generic.py:309: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _torch_pytree._register_pytree_node(
+++++ Env Variables +++++
������
:
    ENABLE_IOMP     = 1
    ENABLE_GPU      = 1
    ENABLE_JEMALLOC = 0
    ENABLE_TCMALLOC = 0
    LIB_DIR    = /usr/local/lib
    BIN_DIR    = bin64
    LLM_DIR    = /usr/local/lib/python3.11/dist-packages/ipex_llm
������
:
    LD_PRELOAD             = 
    OMP_NUM_THREADS        = 
    MALLOC_CONF            = 
    USE_XETLA              = OFF
    ENABLE_SDP_FUSION      = 
    SYCL_CACHE_PERSISTENT  = 1
    BIGDL_LLM_XMX_DISABLED = 
    SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS = 1
+++++++++++++++++++++++++
������
.
2024/12/09 00:57:46 routes.go:1125: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: OLLAMA_DEBUG:true OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR: OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]"
time=2024-12-09T00:57:46.256+08:00 level=INFO source=images.go:753 msg="total blobs: 42"
time=2024-12-09T00:57:46.257+08:00 level=INFO source=images.go:760 msg="total unused blobs removed: 0"
[GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached.
[GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production.
 - using env:	export GIN_MODE=release
 - using code:	gin.SetMode(gin.ReleaseMode)
[GIN-debug] POST   /api/pull                 --> github.com/ollama/ollama/server.(*Server).PullModelHandler-fm (5 handlers)
[GIN-debug] POST   /api/generate             --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (5 handlers)
[GIN-debug] POST   /api/chat                 --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (5 handlers)
[GIN-debug] POST   /api/embed                --> github.com/ollama/ollama/server.(*Server).EmbedHandler-fm (5 handlers)
[GIN-debug] POST   /api/embeddings           --> github.com/ollama/ollama/server.(*Server).EmbeddingsHandler-fm (5 handlers)
[GIN-debug] POST   /api/create               --> github.com/ollama/ollama/server.(*Server).CreateModelHandler-fm (5 handlers)
[GIN-debug] POST   /api/push                 --> github.com/ollama/ollama/server.(*Server).PushModelHandler-fm (5 handlers)
[GIN-debug] POST   /api/copy                 --> github.com/ollama/ollama/server.(*Server).CopyModelHandler-fm (5 handlers)
[GIN-debug] DELETE /api/delete               --> github.com/ollama/ollama/server.(*Server).DeleteModelHandler-fm (5 handlers)
[GIN-debug] POST   /api/show                 --> github.com/ollama/ollama/server.(*Server).ShowModelHandler-fm (5 handlers)
[GIN-debug] POST   /api/blobs/:digest        --> github.com/ollama/ollama/server.(*Server).CreateBlobHandler-fm (5 handlers)
[GIN-debug] HEAD   /api/blobs/:digest        --> github.com/ollama/ollama/server.(*Server).HeadBlobHandler-fm (5 handlers)
[GIN-debug] GET    /api/ps                   --> github.com/ollama/ollama/server.(*Server).ProcessHandler-fm (5 handlers)
[GIN-debug] POST   /v1/chat/completions      --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (6 handlers)
[GIN-debug] POST   /v1/completions           --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (6 handlers)
[GIN-debug] POST   /v1/embeddings            --> github.com/ollama/ollama/server.(*Server).EmbedHandler-fm (6 handlers)
[GIN-debug] GET    /v1/models                --> github.com/ollama/ollama/server.(*Server).ListModelsHandler-fm (6 handlers)
time=2024-12-09T00:57:46.257+08:00 level=INFO source=routes.go:1172 msg="Listening on [::]:11434 (version 0.3.6-ipexllm-20241204)"
[GIN-debug] GET    /v1/models/:model         --> github.com/ollama/ollama/server.(*Server).ShowModelHandler-fm (6 handlers)
[GIN-debug] GET    /                         --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers)
[GIN-debug] GET    /api/tags                 --> github.com/ollama/ollama/server.(*Server).ListModelsHandler-fm (5 handlers)
[GIN-debug] GET    /api/version              --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
[GIN-debug] HEAD   /                         --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers)
[GIN-debug] HEAD   /api/tags                 --> github.com/ollama/ollama/server.(*Server).ListModelsHandler-fm (5 handlers)
[GIN-debug] HEAD   /api/version              --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
time=2024-12-09T00:57:46.257+08:00 level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama2506652849/runners
time=2024-12-09T00:57:46.258+08:00 level=DEBUG source=payload.go:182 msg=extracting variant=cpu file=build/linux/x86_64/cpu/bin/libggml.so.gz
time=2024-12-09T00:57:46.258+08:00 level=DEBUG source=payload.go:182 msg=extracting variant=cpu file=build/linux/x86_64/cpu/bin/libllama.so.gz
time=2024-12-09T00:57:46.258+08:00 level=DEBUG source=payload.go:182 msg=extracting variant=cpu file=build/linux/x86_64/cpu/bin/ollama_llama_server.gz
time=2024-12-09T00:57:46.258+08:00 level=DEBUG source=payload.go:182 msg=extracting variant=cpu_avx file=build/linux/x86_64/cpu_avx/bin/libggml.so.gz
time=2024-12-09T00:57:46.258+08:00 level=DEBUG source=payload.go:182 msg=extracting variant=cpu_avx file=build/linux/x86_64/cpu_avx/bin/libllama.so.gz
time=2024-12-09T00:57:46.258+08:00 level=DEBUG source=payload.go:182 msg=extracting variant=cpu_avx file=build/linux/x86_64/cpu_avx/bin/ollama_llama_server.gz
time=2024-12-09T00:57:46.258+08:00 level=DEBUG source=payload.go:182 msg=extracting variant=cpu_avx2 file=build/linux/x86_64/cpu_avx2/bin/libggml.so.gz
time=2024-12-09T00:57:46.258+08:00 level=DEBUG source=payload.go:182 msg=extracting variant=cpu_avx2 file=build/linux/x86_64/cpu_avx2/bin/libllama.so.gz
time=2024-12-09T00:57:46.258+08:00 level=DEBUG source=payload.go:182 msg=extracting variant=cpu_avx2 file=build/linux/x86_64/cpu_avx2/bin/ollama_llama_server.gz
time=2024-12-09T00:57:46.395+08:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama2506652849/runners/cpu/ollama_llama_server
time=2024-12-09T00:57:46.395+08:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama2506652849/runners/cpu_avx/ollama_llama_server
time=2024-12-09T00:57:46.395+08:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama2506652849/runners/cpu_avx2/ollama_llama_server
time=2024-12-09T00:57:46.395+08:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu cpu_avx cpu_avx2]"
time=2024-12-09T00:57:46.395+08:00 level=DEBUG source=payload.go:45 msg="Override detection logic by setting OLLAMA_LLM_LIBRARY"
time=2024-12-09T00:57:46.395+08:00 level=DEBUG source=sched.go:105 msg="starting llm scheduler"
[GIN] 2024/12/09 - 00:57:58 | 200 |    1.402222ms |      172.16.6.3 | GET      "/api/tags"
[GIN] 2024/12/09 - 00:57:58 | 200 |      64.402µs |      172.16.6.3 | GET      "/api/version"
[GIN] 2024/12/09 - 00:58:02 | 200 |    1.948256ms |      172.16.6.3 | GET      "/api/tags"
time=2024-12-09T00:58:02.875+08:00 level=INFO source=gpu.go:168 msg="looking for compatible GPUs"
time=2024-12-09T00:58:02.875+08:00 level=WARN source=gpu.go:560 msg="unable to locate gpu dependency libraries"
time=2024-12-09T00:58:02.875+08:00 level=DEBUG source=gpu.go:79 msg="searching for GPU discovery libraries for NVIDIA"
time=2024-12-09T00:58:02.875+08:00 level=DEBUG source=gpu.go:382 msg="Searching for GPU library" name=libcuda.so*
time=2024-12-09T00:58:02.876+08:00 level=WARN source=gpu.go:560 msg="unable to locate gpu dependency libraries"
time=2024-12-09T00:58:02.876+08:00 level=DEBUG source=gpu.go:405 msg="gpu library search" globs="[libcuda.so* /opt/intel/oneapi/tbb/2021.11/lib/intel64/gcc4.8/libcuda.so* /opt/intel/oneapi/mpi/2021.11/opt/mpi/libfabric/lib/libcuda.so* /opt/intel/oneapi/mpi/2021.11/lib/libcuda.so* /opt/intel/oneapi/mkl/2024.0/lib/libcuda.so* /opt/intel/oneapi/ippcp/2021.9/lib/libcuda.so* /opt/intel/oneapi/ipp/2021.10/lib/libcuda.so* /opt/intel/oneapi/dpl/2022.3/lib/libcuda.so* /opt/intel/oneapi/dnnl/2024.0/lib/libcuda.so* /opt/intel/oneapi/debugger/2024.0/opt/debugger/lib/libcuda.so* /opt/intel/oneapi/dal/2024.0/lib/libcuda.so* /opt/intel/oneapi/compiler/2024.0/opt/oclfpga/host/linux64/lib/libcuda.so* /opt/intel/oneapi/compiler/2024.0/opt/compiler/lib/libcuda.so* /opt/intel/oneapi/compiler/2024.0/lib/libcuda.so* /opt/intel/oneapi/ccl/2021.11/lib/libcuda.so* /opt/intel/oneapi/tbb/2021.11/lib/intel64/gcc4.8/libcuda.so* /opt/intel/oneapi/mpi/2021.11/opt/mpi/libfabric/lib/libcuda.so* /opt/intel/oneapi/mpi/2021.11/lib/libcuda.so* /opt/intel/oneapi/mkl/2024.0/lib/libcuda.so* /opt/intel/oneapi/ippcp/2021.9/lib/libcuda.so* /opt/intel/oneapi/ipp/2021.10/lib/libcuda.so* /opt/intel/oneapi/dpl/2022.3/lib/libcuda.so* /opt/intel/oneapi/dnnl/2024.0/lib/libcuda.so* /opt/intel/oneapi/debugger/2024.0/opt/debugger/lib/libcuda.so* /opt/intel/oneapi/dal/2024.0/lib/libcuda.so* /opt/intel/oneapi/compiler/2024.0/opt/oclfpga/host/linux64/lib/libcuda.so* /opt/intel/oneapi/compiler/2024.0/opt/compiler/lib/libcuda.so* /opt/intel/oneapi/compiler/2024.0/lib/libcuda.so* /opt/intel/oneapi/ccl/2021.11/lib/libcuda.so* /usr/local/cuda*/targets/*/lib/libcuda.so* /usr/lib/*-linux-gnu/nvidia/current/libcuda.so* /usr/lib/*-linux-gnu/libcuda.so* /usr/lib/wsl/lib/libcuda.so* /usr/lib/wsl/drivers/*/libcuda.so* /opt/cuda/lib*/libcuda.so* /usr/local/cuda/lib*/libcuda.so* /usr/lib*/libcuda.so* /usr/local/lib*/libcuda.so*]"
time=2024-12-09T00:58:02.880+08:00 level=DEBUG source=gpu.go:439 msg="discovered GPU libraries" paths=[]
time=2024-12-09T00:58:02.880+08:00 level=DEBUG source=gpu.go:382 msg="Searching for GPU library" name=libcudart.so*
time=2024-12-09T00:58:02.880+08:00 level=WARN source=gpu.go:560 msg="unable to locate gpu dependency libraries"
time=2024-12-09T00:58:02.880+08:00 level=DEBUG source=gpu.go:405 msg="gpu library search" globs="[libcudart.so* /opt/intel/oneapi/tbb/2021.11/lib/intel64/gcc4.8/libcudart.so* /opt/intel/oneapi/mpi/2021.11/opt/mpi/libfabric/lib/libcudart.so* /opt/intel/oneapi/mpi/2021.11/lib/libcudart.so* /opt/intel/oneapi/mkl/2024.0/lib/libcudart.so* /opt/intel/oneapi/ippcp/2021.9/lib/libcudart.so* /opt/intel/oneapi/ipp/2021.10/lib/libcudart.so* /opt/intel/oneapi/dpl/2022.3/lib/libcudart.so* /opt/intel/oneapi/dnnl/2024.0/lib/libcudart.so* /opt/intel/oneapi/debugger/2024.0/opt/debugger/lib/libcudart.so* /opt/intel/oneapi/dal/2024.0/lib/libcudart.so* /opt/intel/oneapi/compiler/2024.0/opt/oclfpga/host/linux64/lib/libcudart.so* /opt/intel/oneapi/compiler/2024.0/opt/compiler/lib/libcudart.so* /opt/intel/oneapi/compiler/2024.0/lib/libcudart.so* /opt/intel/oneapi/ccl/2021.11/lib/libcudart.so* /opt/intel/oneapi/tbb/2021.11/lib/intel64/gcc4.8/libcudart.so* /opt/intel/oneapi/mpi/2021.11/opt/mpi/libfabric/lib/libcudart.so* /opt/intel/oneapi/mpi/2021.11/lib/libcudart.so* /opt/intel/oneapi/mkl/2024.0/lib/libcudart.so* /opt/intel/oneapi/ippcp/2021.9/lib/libcudart.so* /opt/intel/oneapi/ipp/2021.10/lib/libcudart.so* /opt/intel/oneapi/dpl/2022.3/lib/libcudart.so* /opt/intel/oneapi/dnnl/2024.0/lib/libcudart.so* /opt/intel/oneapi/debugger/2024.0/opt/debugger/lib/libcudart.so* /opt/intel/oneapi/dal/2024.0/lib/libcudart.so* /opt/intel/oneapi/compiler/2024.0/opt/oclfpga/host/linux64/lib/libcudart.so* /opt/intel/oneapi/compiler/2024.0/opt/compiler/lib/libcudart.so* /opt/intel/oneapi/compiler/2024.0/lib/libcudart.so* /opt/intel/oneapi/ccl/2021.11/lib/libcudart.so* /tmp/ollama2506652849/runners/cuda*/libcudart.so* /usr/local/cuda/lib64/libcudart.so* /usr/lib/x86_64-linux-gnu/nvidia/current/libcudart.so* /usr/lib/x86_64-linux-gnu/libcudart.so* /usr/lib/wsl/lib/libcudart.so* /usr/lib/wsl/drivers/*/libcudart.so* /opt/cuda/lib64/libcudart.so* /usr/local/cuda*/targets/aarch64-linux/lib/libcudart.so* /usr/lib/aarch64-linux-gnu/nvidia/current/libcudart.so* /usr/lib/aarch64-linux-gnu/libcudart.so* /usr/local/cuda/lib*/libcudart.so* /usr/lib*/libcudart.so* /usr/local/lib*/libcudart.so*]"
time=2024-12-09T00:58:02.882+08:00 level=DEBUG source=gpu.go:439 msg="discovered GPU libraries" paths=[]
time=2024-12-09T00:58:02.882+08:00 level=DEBUG source=amd_linux.go:371 msg="amdgpu driver not detected /sys/module/amdgpu"
time=2024-12-09T00:58:02.882+08:00 level=INFO source=gpu.go:280 msg="no compatible GPUs were discovered"
time=2024-12-09T00:58:02.882+08:00 level=DEBUG source=sched.go:181 msg="updating default concurrency" OLLAMA_MAX_LOADED_MODELS=0x83a520 gpu_count=1
time=2024-12-09T00:58:02.941+08:00 level=DEBUG source=sched.go:211 msg="cpu mode with first model, loading"
time=2024-12-09T00:58:02.941+08:00 level=DEBUG source=server.go:101 msg="system memory" total="62.7 GiB" free="23.5 GiB" free_swap="0 B"
time=2024-12-09T00:58:02.941+08:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama2506652849/runners/cpu/ollama_llama_server
time=2024-12-09T00:58:02.941+08:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama2506652849/runners/cpu_avx/ollama_llama_server
time=2024-12-09T00:58:02.941+08:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama2506652849/runners/cpu_avx2/ollama_llama_server
time=2024-12-09T00:58:02.941+08:00 level=DEBUG source=memory.go:101 msg=evaluating library=cpu gpu_count=1 available="[23.5 GiB]"
time=2024-12-09T00:58:02.941+08:00 level=INFO source=memory.go:309 msg="offload to cpu" layers.requested=-1 layers.model=29 layers.offload=0 layers.split="" memory.available="[23.5 GiB]" memory.required.full="2.3 GiB" memory.required.partial="0 B" memory.required.kv="224.0 MiB" memory.required.allocations="[2.3 GiB]" memory.weights.total="1.8 GiB" memory.weights.repeating="1.5 GiB" memory.weights.nonrepeating="308.2 MiB" memory.graph.full="124.0 MiB" memory.graph.partial="570.7 MiB"
time=2024-12-09T00:58:02.942+08:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama2506652849/runners/cpu/ollama_llama_server
time=2024-12-09T00:58:02.942+08:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama2506652849/runners/cpu_avx/ollama_llama_server
time=2024-12-09T00:58:02.942+08:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama2506652849/runners/cpu_avx2/ollama_llama_server
time=2024-12-09T00:58:02.943+08:00 level=DEBUG source=gpu.go:531 msg="no filter required for library cpu"
time=2024-12-09T00:58:02.943+08:00 level=INFO source=server.go:395 msg="starting llama server" cmd="/tmp/ollama2506652849/runners/cpu_avx2/ollama_llama_server --model /root/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 999 --verbose --no-mmap --parallel 1 --port 41009"
time=2024-12-09T00:58:02.943+08:00 level=DEBUG source=server.go:412 msg=subprocess environment="[LD_LIBRARY_PATH=/tmp/ollama2506652849/runners/cpu_avx2:/opt/intel/oneapi/tbb/2021.11/env/../lib/intel64/gcc4.8:/opt/intel/oneapi/mpi/2021.11/opt/mpi/libfabric/lib:/opt/intel/oneapi/mpi/2021.11/lib:/opt/intel/oneapi/mkl/2024.0/lib:/opt/intel/oneapi/ippcp/2021.9/lib/:/opt/intel/oneapi/ipp/2021.10/lib:/opt/intel/oneapi/dpl/2022.3/lib:/opt/intel/oneapi/dnnl/2024.0/lib:/opt/intel/oneapi/debugger/2024.0/opt/debugger/lib:/opt/intel/oneapi/dal/2024.0/lib:/opt/intel/oneapi/compiler/2024.0/opt/oclfpga/host/linux64/lib:/opt/intel/oneapi/compiler/2024.0/opt/compiler/lib:/opt/intel/oneapi/compiler/2024.0/lib:/opt/intel/oneapi/ccl/2021.11/lib/:/opt/intel/oneapi/tbb/2021.11/env/../lib/intel64/gcc4.8:/opt/intel/oneapi/mpi/2021.11/opt/mpi/libfabric/lib:/opt/intel/oneapi/mpi/2021.11/lib:/opt/intel/oneapi/mkl/2024.0/lib:/opt/intel/oneapi/ippcp/2021.9/lib/:/opt/intel/oneapi/ipp/2021.10/lib:/opt/intel/oneapi/dpl/2022.3/lib:/opt/intel/oneapi/dnnl/2024.0/lib:/opt/intel/oneapi/debugger/2024.0/opt/debugger/lib:/opt/intel/oneapi/dal/2024.0/lib:/opt/intel/oneapi/compiler/2024.0/opt/oclfpga/host/linux64/lib:/opt/intel/oneapi/compiler/2024.0/opt/compiler/lib:/opt/intel/oneapi/compiler/2024.0/lib:/opt/intel/oneapi/ccl/2021.11/lib/ PATH=/opt/intel/oneapi/vtune/2024.0/bin64:/opt/intel/oneapi/mpi/2021.11/opt/mpi/libfabric/bin:/opt/intel/oneapi/mpi/2021.11/bin:/opt/intel/oneapi/mkl/2024.0/bin/:/opt/intel/oneapi/dpcpp-ct/2024.0/bin:/opt/intel/oneapi/dev-utilities/2024.0/bin:/opt/intel/oneapi/debugger/2024.0/opt/debugger/bin:/opt/intel/oneapi/compiler/2024.0/opt/oclfpga/bin:/opt/intel/oneapi/compiler/2024.0/bin:/opt/intel/oneapi/advisor/2024.0/bin64:/opt/intel/oneapi/vtune/2024.0/bin64:/opt/intel/oneapi/mpi/2021.11/opt/mpi/libfabric/bin:/opt/intel/oneapi/mpi/2021.11/bin:/opt/intel/oneapi/mkl/2024.0/bin/:/opt/intel/oneapi/dpcpp-ct/2024.0/bin:/opt/intel/oneapi/dev-utilities/2024.0/bin:/opt/intel/oneapi/debugger/2024.0/opt/debugger/bin:/opt/intel/oneapi/compiler/2024.0/opt/oclfpga/bin:/opt/intel/oneapi/compiler/2024.0/bin:/opt/intel/oneapi/advisor/2024.0/bin64:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin]"
time=2024-12-09T00:58:02.944+08:00 level=INFO source=sched.go:450 msg="loaded runners" count=1
time=2024-12-09T00:58:02.944+08:00 level=INFO source=server.go:595 msg="waiting for llama runner to start responding"
time=2024-12-09T00:58:02.944+08:00 level=INFO source=server.go:629 msg="waiting for server to become available" status="llm server error"
INFO [main] build info | build=1 commit="f711d1d" tid="140603310369792" timestamp=1733677082
INFO [main] system info | n_threads=6 n_threads_batch=-1 system_info="AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="140603310369792" timestamp=1733677082 total_threads=12
INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="11" port="41009" tid="140603310369792" timestamp=1733677082
llama_model_loader: loaded meta data with 30 key-value pairs and 255 tensors from /root/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Llama 3.2 3B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Llama-3.2
llama_model_loader: - kv   5:                         general.size_label str              = 3B
llama_model_loader: - kv   6:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv   7:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv   8:                          llama.block_count u32              = 28
llama_model_loader: - kv   9:                       llama.context_length u32              = 131072
llama_model_loader: - kv  10:                     llama.embedding_length u32              = 3072
llama_model_loader: - kv  11:                  llama.feed_forward_length u32              = 8192
llama_model_loader: - kv  12:                 llama.attention.head_count u32              = 24
llama_model_loader: - kv  13:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  14:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  15:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  16:                 llama.attention.key_length u32              = 128
llama_model_loader: - kv  17:               llama.attention.value_length u32              = 128
llama_model_loader: - kv  18:                          general.file_type u32              = 15
llama_model_loader: - kv  19:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  20:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  21:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  22:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  23:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  24:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  25:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  26:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  27:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  28:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv  29:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   58 tensors
llama_model_loader: - type q4_K:  168 tensors
llama_model_loader: - type q6_K:   29 tensors
time=2024-12-09T00:58:03.195+08:00 level=INFO source=server.go:629 msg="waiting for server to become available" status="llm server loading model"
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 3072
llm_load_print_meta: n_layer          = 28
llm_load_print_meta: n_head           = 24
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 3
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 8192
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 3.21 B
llm_load_print_meta: model size       = 1.87 GiB (5.01 BPW) 
llm_load_print_meta: general.name     = Llama 3.2 3B Instruct
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token        = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token        = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
ggml_sycl_init: GGML_SYCL_FORCE_MMQ:   no
ggml_sycl_init: SYCL_USE_XMX: yes
ggml_sycl_init: found 1 SYCL devices:
llm_load_tensors: ggml ctx size =    0.24 MiB
llm_load_tensors: offloading 28 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 29/29 layers to GPU
llm_load_tensors:      SYCL0 buffer size =  1918.36 MiB
llm_load_tensors:  SYCL_Host buffer size =   308.23 MiB
time=2024-12-09T00:58:03.646+08:00 level=INFO source=server.go:629 msg="waiting for server to become available" status="llm server not responding"
time=2024-12-09T00:58:04.348+08:00 level=ERROR source=sched.go:456 msg="error loading llama server" error="llama runner process has terminated: signal: bus error (core dumped)"
[GIN] 2024/12/09 - 00:58:04 | 500 |  1.520528721s |      172.16.6.3 | POST     "/api/chat"
time=2024-12-09T00:58:04.348+08:00 level=DEBUG source=sched.go:459 msg="triggering expiration for failed load" model=/root/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff
time=2024-12-09T00:58:04.348+08:00 level=DEBUG source=sched.go:360 msg="runner expired event received" modelPath=/root/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff
time=2024-12-09T00:58:04.348+08:00 level=DEBUG source=sched.go:376 msg="got lock to unload" modelPath=/root/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff
time=2024-12-09T00:58:04.348+08:00 level=DEBUG source=server.go:1052 msg="stopping llama server"
time=2024-12-09T00:58:04.348+08:00 level=DEBUG source=sched.go:381 msg="runner released" modelPath=/root/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff
time=2024-12-09T00:58:04.348+08:00 level=DEBUG source=sched.go:385 msg="sending an unloaded event" modelPath=/root/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff
time=2024-12-09T00:58:04.348+08:00 level=DEBUG source=sched.go:308 msg="ignoring unload event with no pending requests"

pauleseifert avatar Dec 08 '24 17:12 pauleseifert

Hi @pauleseifert. I think this should be an OOM issue, you may try to set OLLAMA_PARALLEL=1 before you start ollama serve to reduce memory usage.

sgwhat avatar Dec 09 '24 02:12 sgwhat

Hi @sgwhat. I agree, that's what it looks like. ENV OLLAMA_NUM_PARALLEL=1 is, however, already set in my docker compose file. Any other ideas?

pauleseifert avatar Dec 09 '24 09:12 pauleseifert

  1. Sorry for the typo error, it should be OLLAMA_PARALLEL=1 instead of OLLAMA_NUM_PARALLEL.
  2. Could you please check and provide your GPU memory usage when running Ollama?

sgwhat avatar Dec 10 '24 02:12 sgwhat

This doesn't help. The runner still crashes. intel_gpu_top showed normal behavior for the short moment the runner was visible. There are no other processes running so all memory should be available.

pauleseifert avatar Dec 16 '24 19:12 pauleseifert

Can you provide the memory usage before and after running ollama run <model>? This can help us resolve the issue.

sgwhat avatar Dec 17 '24 02:12 sgwhat

I have the same problem with an Intel Arc A310 4GB.

I've displayed the RAM used when I run the ollama run process and it's clear that there isn't much RAM used.

Do you have any ideas @sgwhat ?

for n in {1..100} ; do sleep 0.1 ; date ; ps -ef | grep "ollama run deepseek-r1:1.5b" | grep -v grep ; free -m | awk 'NR==2{print $3}' ; done
Tue Apr 22 11:04:39 PM CEST 2025
21238
Tue Apr 22 11:04:39 PM CEST 2025
21239
Tue Apr 22 11:04:39 PM CEST 2025
21239
Tue Apr 22 11:04:39 PM CEST 2025
21239
Tue Apr 22 11:04:39 PM CEST 2025
21239
Tue Apr 22 11:04:39 PM CEST 2025
21239
Tue Apr 22 11:04:40 PM CEST 2025
root     105725 114872  0 23:04 pts/2    00:00:00 docker exec -it ollama-intel-gpu ./ollama run deepseek-r1:1.5b
21251
Tue Apr 22 11:04:40 PM CEST 2025
root     105725 114872  0 23:04 pts/2    00:00:00 docker exec -it ollama-intel-gpu ./ollama run deepseek-r1:1.5b
root     105746  71168  0 23:04 pts/0    00:00:00 ./ollama run deepseek-r1:1.5b
21279
Tue Apr 22 11:04:40 PM CEST 2025
root     105725 114872  0 23:04 pts/2    00:00:00 docker exec -it ollama-intel-gpu ./ollama run deepseek-r1:1.5b
root     105746  71168  0 23:04 pts/0    00:00:00 ./ollama run deepseek-r1:1.5b
21293
Tue Apr 22 11:04:40 PM CEST 2025
root     105725 114872  0 23:04 pts/2    00:00:00 docker exec -it ollama-intel-gpu ./ollama run deepseek-r1:1.5b
root     105746  71168  0 23:04 pts/0    00:00:00 ./ollama run deepseek-r1:1.5b
21318
Tue Apr 22 11:04:40 PM CEST 2025
root     105725 114872  0 23:04 pts/2    00:00:00 docker exec -it ollama-intel-gpu ./ollama run deepseek-r1:1.5b
root     105746  71168  0 23:04 pts/0    00:00:00 ./ollama run deepseek-r1:1.5b
21390
Tue Apr 22 11:04:40 PM CEST 2025
root     105725 114872  0 23:04 pts/2    00:00:00 docker exec -it ollama-intel-gpu ./ollama run deepseek-r1:1.5b
root     105746  71168  0 23:04 pts/0    00:00:00 ./ollama run deepseek-r1:1.5b
21421
Tue Apr 22 11:04:41 PM CEST 2025
root     105725 114872  0 23:04 pts/2    00:00:00 docker exec -it ollama-intel-gpu ./ollama run deepseek-r1:1.5b
root     105746  71168  0 23:04 pts/0    00:00:00 ./ollama run deepseek-r1:1.5b
21423
Tue Apr 22 11:04:41 PM CEST 2025
root     105725 114872  0 23:04 pts/2    00:00:00 docker exec -it ollama-intel-gpu ./ollama run deepseek-r1:1.5b
root     105746  71168  0 23:04 pts/0    00:00:00 ./ollama run deepseek-r1:1.5b
21318
Tue Apr 22 11:04:41 PM CEST 2025
21287
Tue Apr 22 11:04:41 PM CEST 2025
21286
Tue Apr 22 11:04:41 PM CEST 2025
21286

Floflobel avatar Apr 22 '25 21:04 Floflobel