llamafile GPU offloading doesn't seem to be working

Hey everyone, awesome project :-) am having fun playing around with it, but I think my GPU isn't being utilised. I can see my CPU maxing out, and not seeing much of a change in my GPU usage, just wondering what the issue is. Here's the output in terminal:

/media/storage/Software/AI/Meta-Llama-3-70B-Instruct.Q4_0.llamafile -ngl 9999
import_cuda_impl: initializing gpu module...
get_rocm_bin_path: note: amdclang++ not found on $PATH
get_rocm_bin_path: note: $HIP_PATH/bin/amdclang++ does not exist
get_rocm_bin_path: note: /opt/rocm/bin/amdclang++ does not exist
get_rocm_bin_path: note: hipInfo not found on $PATH
get_rocm_bin_path: note: $HIP_PATH/bin/hipInfo does not exist
get_rocm_bin_path: note: /opt/rocm/bin/hipInfo does not exist
get_rocm_bin_path: note: rocminfo not found on $PATH
get_rocm_bin_path: note: $HIP_PATH/bin/rocminfo does not exist
get_rocm_bin_path: note: /opt/rocm/bin/rocminfo does not exist
get_amd_offload_arch_flag: warning: can't find hipInfo/rocminfo commands for AMD GPU detection
llamafile_log_command: hipcc -O3 -fPIC -shared -DNDEBUG --offload-arch=native -march=native -mtune=native -DGGML_BUILD=1 -DGGML_SHARED=1 -Wno-return-type -Wno-unused-result -DGGML_USE_HIPBLAS -DGGML_CUDA_MMV_Y=1 -DGGML_MULTIPLATFORM -DGGML_CUDA_DMMV_X=32 -DIGNORE4 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -DIGNORE -o /home/v4u6h4n/.llamafile/ggml-rocm.so.dhsn3g /home/v4u6h4n/.llamafile/ggml-cuda.cu -lhipblas -lrocblas
hipcc: Permission denied
extract_cuda_dso: note: prebuilt binary /zip/ggml-rocm.so not found
get_nvcc_path: note: nvcc not found on $PATH
get_nvcc_path: note: $CUDA_PATH/bin/nvcc does not exist
get_nvcc_path: note: /opt/cuda/bin/nvcc does not exist
get_nvcc_path: note: /usr/local/cuda/bin/nvcc does not exist
extract_cuda_dso: note: prebuilt binary /zip/ggml-cuda.so not found
{"function":"server_params_parse","level":"WARN","line":2384,"msg":"Not compiled with GPU offload support, --n-gpu-layers option will be ignored. See main README.md for information on enabling GPU BLAS support","n_gpu_layers":-1,"tid":"8545344","timestamp":1714335027}
note: if you have an AMD or NVIDIA GPU then you need to pass -ngl 9999 to enable GPU offloading
{"build":1500,"commit":"a30b324","function":"server_cli","level":"INFO","line":2839,"msg":"build info","tid":"8545344","timestamp":1714335027}
{"function":"server_cli","level":"INFO","line":2842,"msg":"system info","n_threads":16,"n_threads_batch":-1,"system_info":"AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | ","tid":"8545344","timestamp":1714335027,"total_threads":32}
llama_model_loader: loaded meta data with 22 key-value pairs and 723 tensors from Meta-Llama-3-70B-Instruct.Q4_0.gguf (version GGUF V3 (latest))

...and my system specs:

OS: Arch Linux x86_64
Kernel: 6.8.7-arch1-2
CPU: AMD Ryzen 9 7950X3D (32) @ 5.759GHz
GPU: AMD ATI Radeon RX 7900 XT/7900 XTX/7900M
GPU: AMD ATI 13:00.0 Raphael
Memory: 14430MiB / 63427MiB

Apr 28 '24 23:04 v4u6h4n

Same here, Radeon Pro W5700

llava-v1.5-7b-q4.llamafile --version
llamafile v0.8.0

Apr 29 '24 18:04 ahonnecke

relevant perhaps: https://rocm.docs.amd.com/projects/install-on-linux/en/latest/tutorial/quick-start.html

Apr 29 '24 19:04 ahonnecke

relevant perhaps: https://rocm.docs.amd.com/projects/install-on-linux/en/latest/tutorial/quick-start.html

Hey :-)

Did it fix anything for you?

Apr 29 '24 20:04 v4u6h4n

Doesn't seem to have, but I'm not sure that it install properly.

Apr 29 '24 20:04 ahonnecke

I was able to make it work by changing the base image of my container to FROM nvcr.io/nvidia/pytorch:24.03-py3

That base image is gigantic (~14.6 GB), so probably the best option would be to use docker multi stage build to extract nvcc and its dependencies.

May 06 '24 19:05 fcrisciani

@fcrisciani Unfortunately I am enough of an amateur linux user that I don't know what that means lol but happy you got it working ;-)

May 07 '24 06:05 v4u6h4n

I was referring to creating a docker image (https://docs.docker.com/engine/install/)

My Dockerfile looks like:

FROM nvcr.io/nvidia/pytorch:24.03-py3

RUN apt update && apt install -y wget

COPY start.sh /
RUN chmod +x /start.sh

CMD /start.sh

the start file looks like:

#!/bin/bash

echo "Download llamafile..."
wget https://huggingface.co/jartine/llava-v1.5-7B-GGUF/resolve/main/llava-v1.5-7b-q4.llamafile?download=true -O /tmp/llava-v1.5-7b-q4.llamafile

echo "Start serving the llamafile"
chmod +x /tmp/llava-v1.5-7b-q4.llamafile
/tmp/llava-v1.5-7b-q4.llamafile -ngl 999 --gpu nvidia --nobrowser --host 0.0.0.0

you can:

install docker
create a folder with the 2 files above: Dockerfile and start.sh
build the container image: docker build -t my_gpu_test .
run it: docker run --rm -it --gpus=all my_gpu_test

May 07 '24 15:05 fcrisciani

@fcrisciani it looks like you may be suggesting a fix that works in your case with an nvidia gpu, but the OP issue relates to an amd gpu problem. Considering the use-case of llamafile being a single file LLM that utilizes you gpu, wouldn't a docker install be a big overkill for this problem, and would your fix even address the amd side of things?

Jun 17 '24 18:06 s38b35M5

conceptually the solution is the same, my understanding is that for nvidia GPU nvcc is the dependency, for AMD instead is hipcc. If you properly install on your machine all the dependencies it should work without using docker. I used docker just to create an image with all the dependencies backed in so that I can move it on different machines without manually installing all the dependencies but it's a user preference

Jun 17 '24 20:06 fcrisciani

llamafile llamafile copied to clipboard

GPU offloading doesn't seem to be working

llamafile
llamafile copied to clipboard