mistral.rs
mistral.rs copied to clipboard
WSL2 Docker error loading llama-3.1 gguf
Describe the bug
My environment
Windows 11 Pro, Docker Desktop, WSL2 Ubuntu Engine, latest nvidia driver
Cuda test
I made sure the Docker WSL2 Cuda implementation works correctly by executing:
docker run --rm -it --gpus=all nvcr.io/nvidia/k8s/cuda-sample:nbody nbody -gpu -benchmark
as stated in the documentation. So cuda works inside Docker with WSL2.
Model loading error
docker run --gpus all --rm -v C:\Users\xxx\.cache\lm-studio\models\duyntnet\Meta-Llama-3.1-8B-Instruct-imatrix-GGUF:/model -p 8080:8080 ghcr.io/ericlbuehler/mistral.rs:cuda-90-sha-8a84d05 gguf -m /model -f Meta-Llama-3.1-8B-Instruct-IQ4_NL.gguf
leads to
...
2024-08-12T20:56:20.241100Z INFO mistralrs_core::pipeline::paths: Loading `Meta-Llama-3.1-8B-Instruct-IQ4_NL.gguf` locally at `/model/Meta-Llama-3.1-8B-Instruct-IQ4_NL.gguf`
2024-08-12T20:56:20.244485Z INFO mistralrs_core::pipeline::gguf: Loading model `/model` on cuda[0].
Error: path: "/model/Meta-Llama-3.1-8B-Instruct-IQ4_NL.gguf" unknown dtype for tensor 20
maybe iMatrix Quants are not supported?
Trying a normal gguf quant also doesn't seem to work:
docker run --gpus all --rm -v C:\Users\xxx\.cache\lm-studio\models\bartowski\Meta-Llama-3.1-8B-Instruct-GGUF:/model -p 8080:8080 ghcr.io/ericlbuehler/mistral.rs:cuda-90-sha-8a84d05 gguf -m /model -f Meta-Llama-3.1-8B-Instruct-Q6_K_L.gguf
leading to:
...
2024-08-12T20:55:28.177396Z INFO mistralrs_core::gguf::gguf_tokenizer: GGUF tokenizer model is `gpt2`, kind: `Bpe`, num tokens: 128256, num added tokens: 0, num merges: 280147, num scores: 0
2024-08-12T20:55:28.185104Z INFO mistralrs_core::gguf::chat_template: Discovered and using GGUF chat template: `...
Error: DriverError(CUDA_ERROR_INVALID_PTX, "a PTX JIT compilation failed") when loading dequantize_block_q8_0_f32
This is a newer quant after the rope freq issue was fixed in llama.cpp
Port argument not found
Also: I can use the docker argument -p 8080:1234
to map ports. The mistral.rs arguments for --serve-ip 0.0.0.0
works, the --port 1234
doesn't:
docker run --gpus all --rm -v C:\Users\Jan\.cache\lm-studio\models\bartowski\Meta-Llama-3.1-8B-Instruct-GGUF:/model -p 8080:1234 ghcr.io/ericlbuehler/mistral.rs:cuda-90-sha-8a84d05 --serve-ip 0.0.0.0 --port 1234 gguf -m /model -f Meta-Llama-3.1-8B-Instruct-Q6_K_L.gguf
leads to
error: the argument '--port <PORT>' cannot be used multiple times
Latest commit or version
Using Docker ericlbuehler/mistral.rs:cuda-90-sha-8a84d05