llamafile
llamafile copied to clipboard
GPU offloading doesn't seem to be working
Hey everyone, awesome project :-) am having fun playing around with it, but I think my GPU isn't being utilised. I can see my CPU maxing out, and not seeing much of a change in my GPU usage, just wondering what the issue is. Here's the output in terminal:
/media/storage/Software/AI/Meta-Llama-3-70B-Instruct.Q4_0.llamafile -ngl 9999
import_cuda_impl: initializing gpu module...
get_rocm_bin_path: note: amdclang++ not found on $PATH
get_rocm_bin_path: note: $HIP_PATH/bin/amdclang++ does not exist
get_rocm_bin_path: note: /opt/rocm/bin/amdclang++ does not exist
get_rocm_bin_path: note: hipInfo not found on $PATH
get_rocm_bin_path: note: $HIP_PATH/bin/hipInfo does not exist
get_rocm_bin_path: note: /opt/rocm/bin/hipInfo does not exist
get_rocm_bin_path: note: rocminfo not found on $PATH
get_rocm_bin_path: note: $HIP_PATH/bin/rocminfo does not exist
get_rocm_bin_path: note: /opt/rocm/bin/rocminfo does not exist
get_amd_offload_arch_flag: warning: can't find hipInfo/rocminfo commands for AMD GPU detection
llamafile_log_command: hipcc -O3 -fPIC -shared -DNDEBUG --offload-arch=native -march=native -mtune=native -DGGML_BUILD=1 -DGGML_SHARED=1 -Wno-return-type -Wno-unused-result -DGGML_USE_HIPBLAS -DGGML_CUDA_MMV_Y=1 -DGGML_MULTIPLATFORM -DGGML_CUDA_DMMV_X=32 -DIGNORE4 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -DIGNORE -o /home/v4u6h4n/.llamafile/ggml-rocm.so.dhsn3g /home/v4u6h4n/.llamafile/ggml-cuda.cu -lhipblas -lrocblas
hipcc: Permission denied
extract_cuda_dso: note: prebuilt binary /zip/ggml-rocm.so not found
get_nvcc_path: note: nvcc not found on $PATH
get_nvcc_path: note: $CUDA_PATH/bin/nvcc does not exist
get_nvcc_path: note: /opt/cuda/bin/nvcc does not exist
get_nvcc_path: note: /usr/local/cuda/bin/nvcc does not exist
extract_cuda_dso: note: prebuilt binary /zip/ggml-cuda.so not found
{"function":"server_params_parse","level":"WARN","line":2384,"msg":"Not compiled with GPU offload support, --n-gpu-layers option will be ignored. See main README.md for information on enabling GPU BLAS support","n_gpu_layers":-1,"tid":"8545344","timestamp":1714335027}
note: if you have an AMD or NVIDIA GPU then you need to pass -ngl 9999 to enable GPU offloading
{"build":1500,"commit":"a30b324","function":"server_cli","level":"INFO","line":2839,"msg":"build info","tid":"8545344","timestamp":1714335027}
{"function":"server_cli","level":"INFO","line":2842,"msg":"system info","n_threads":16,"n_threads_batch":-1,"system_info":"AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | ","tid":"8545344","timestamp":1714335027,"total_threads":32}
llama_model_loader: loaded meta data with 22 key-value pairs and 723 tensors from Meta-Llama-3-70B-Instruct.Q4_0.gguf (version GGUF V3 (latest))
...and my system specs:
OS: Arch Linux x86_64
Kernel: 6.8.7-arch1-2
CPU: AMD Ryzen 9 7950X3D (32) @ 5.759GHz
GPU: AMD ATI Radeon RX 7900 XT/7900 XTX/7900M
GPU: AMD ATI 13:00.0 Raphael
Memory: 14430MiB / 63427MiB
Same here, Radeon Pro W5700
llava-v1.5-7b-q4.llamafile --version
llamafile v0.8.0
relevant perhaps: https://rocm.docs.amd.com/projects/install-on-linux/en/latest/tutorial/quick-start.html
relevant perhaps: https://rocm.docs.amd.com/projects/install-on-linux/en/latest/tutorial/quick-start.html
Hey :-)
Did it fix anything for you?
Doesn't seem to have, but I'm not sure that it install properly.
I was able to make it work by changing the base image of my container to FROM nvcr.io/nvidia/pytorch:24.03-py3
That base image is gigantic (~14.6 GB), so probably the best option would be to use docker multi stage build to extract nvcc and its dependencies.
@fcrisciani Unfortunately I am enough of an amateur linux user that I don't know what that means lol but happy you got it working ;-)
I was referring to creating a docker image (https://docs.docker.com/engine/install/)
My Dockerfile looks like:
FROM nvcr.io/nvidia/pytorch:24.03-py3
RUN apt update && apt install -y wget
COPY start.sh /
RUN chmod +x /start.sh
CMD /start.sh
the start file looks like:
#!/bin/bash
echo "Download llamafile..."
wget https://huggingface.co/jartine/llava-v1.5-7B-GGUF/resolve/main/llava-v1.5-7b-q4.llamafile?download=true -O /tmp/llava-v1.5-7b-q4.llamafile
echo "Start serving the llamafile"
chmod +x /tmp/llava-v1.5-7b-q4.llamafile
/tmp/llava-v1.5-7b-q4.llamafile -ngl 999 --gpu nvidia --nobrowser --host 0.0.0.0
you can:
- install docker
- create a folder with the 2 files above: Dockerfile and start.sh
- build the container image: docker build -t my_gpu_test .
- run it: docker run --rm -it --gpus=all my_gpu_test
@fcrisciani it looks like you may be suggesting a fix that works in your case with an nvidia gpu, but the OP issue relates to an amd gpu problem. Considering the use-case of llamafile being a single file LLM that utilizes you gpu, wouldn't a docker install be a big overkill for this problem, and would your fix even address the amd side of things?
conceptually the solution is the same, my understanding is that for nvidia GPU nvcc is the dependency, for AMD instead is hipcc. If you properly install on your machine all the dependencies it should work without using docker. I used docker just to create an image with all the dependencies backed in so that I can move it on different machines without manually installing all the dependencies but it's a user preference