text-generation-inference
text-generation-inference copied to clipboard
Unable to download models
Hi, I'm really new at trying out this stuff so perhaps I'm just missing something. I can't seem to get any models with .safetensor files to work, there's always an error saying "no .bin weights found for model." That being said, if I run the docker container without specifying a model, it defaults to downloading bigscience/bloom-560m/model.safetensors. Interestingly, if I choose a model that provides .bin weights I see a message for downloading them that says "No safetensors weights found for model. Downloading PyTorch weights" which seems to indicate that .safetensors are not only supported, but preferred.
When I look at the HF repo for that model, it includes both a .safetensors and .bin file. The one I'm trying to use (TheBloke/wizard-mega-13B-GPTQ) only has a .safetensor weight file. When I attempt to use models with .bin weights (TheBloke/gpt4-x-vicuna-13B-HF and openaccess-ai-collective/manticore-13b) I then get additional errors:
I have been at my computer working on getting this going for about 16 hours now. Unfortunately I have Autism and struggle to learn a lot of things, so I generally need detailed step-by-step instructions to learn something new. If anyone has the time and would be willing to help out here, I'd greatly appreciate it. Also, if there are any guides or resources you could recommend for me, I'd love to become more familiar with this stuff.
Please share your system info with us:
- Ryzen 9 3900X, RTX 4090, 64GB DDR4 3200
- Windows 10 Pro Build 22H2
- Relevant portions of msinfo/dxdiag: https://pastebin.com/78y7kTex
The full command line used that causes issues:
$model = "TheBloke/wizard-mega-13B-GPTQ"
$num_shard = 1
$volume = "$PWD/data"
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume/data ghcr.io/huggingface/text-generation-inference:latest --model-id $model --num-shard $num_shard
Rust version (if self-compiling, cargo version): Whatever the docker image is using
Model being used (curl 127.0.0.1:8080/info | jq): TheBloke/wizard-mega-13B-GPTQ
Hardware used (GPUs, how many, on which cloud) (nvidia-smi): 1x RTX 4090, local
Deployment specificities (Kubernetes, EKS, AKS, any particular deployments): Docker
The current version being used: latest
Information
- [X] Docker
- [ ] The CLI directly
Tasks
- [X] An officially supported command
- [ ] My own modifications
Reproduction
$model = "TheBloke/wizard-mega-13B-GPTQ"
$num_shard = 1
$volume = "$PWD/data"
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume/data ghcr.io/huggingface/text-generation-inference:latest --model-id $model --num-shard $num_shard
Expected behavior
I would expect it to download and run the model.
The first screenshot you shared says that you're trying to write on a read-only volume.
Meaning it's most likely a permission problem onto your data folder.
The second screenshot shows you're trying to use ggml quantized (q4_0, q5_0, etc..) models. Those are files which cannot be read by this project. They are GGML specific. There are tools out there (can't find them atm) to convert those. But usually you should always use the non quantized weights for convertion.
Please note that there is no q4 quantization support in this project right now. We're slowly incorporating GPTQ which provides a much better result that the naive ggml quantization for Q4. (But it doesn't exist yet)
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.