text-generation-inference icon indicating copy to clipboard operation
text-generation-inference copied to clipboard

Unable to download models

Open bulletproofmedic opened this issue 2 years ago • 1 comments
trafficstars

Hi, I'm really new at trying out this stuff so perhaps I'm just missing something. I can't seem to get any models with .safetensor files to work, there's always an error saying "no .bin weights found for model." That being said, if I run the docker container without specifying a model, it defaults to downloading bigscience/bloom-560m/model.safetensors. Interestingly, if I choose a model that provides .bin weights I see a message for downloading them that says "No safetensors weights found for model. Downloading PyTorch weights" which seems to indicate that .safetensors are not only supported, but preferred.

When I look at the HF repo for that model, it includes both a .safetensors and .bin file. The one I'm trying to use (TheBloke/wizard-mega-13B-GPTQ) only has a .safetensor weight file. When I attempt to use models with .bin weights (TheBloke/gpt4-x-vicuna-13B-HF and openaccess-ai-collective/manticore-13b) I then get additional errors: gpt4xvicuna

manticore

I have been at my computer working on getting this going for about 16 hours now. Unfortunately I have Autism and struggle to learn a lot of things, so I generally need detailed step-by-step instructions to learn something new. If anyone has the time and would be willing to help out here, I'd greatly appreciate it. Also, if there are any guides or resources you could recommend for me, I'd love to become more familiar with this stuff.

Please share your system info with us:

  • Ryzen 9 3900X, RTX 4090, 64GB DDR4 3200
  • Windows 10 Pro Build 22H2
  • Relevant portions of msinfo/dxdiag: https://pastebin.com/78y7kTex

The full command line used that causes issues:

$model = "TheBloke/wizard-mega-13B-GPTQ"
$num_shard = 1
$volume = "$PWD/data"

docker run --gpus all --shm-size 1g -p 8080:80 -v $volume/data ghcr.io/huggingface/text-generation-inference:latest --model-id $model --num-shard $num_shard

Rust version (if self-compiling, cargo version): Whatever the docker image is using Model being used (curl 127.0.0.1:8080/info | jq): TheBloke/wizard-mega-13B-GPTQ Hardware used (GPUs, how many, on which cloud) (nvidia-smi): 1x RTX 4090, local smi Deployment specificities (Kubernetes, EKS, AKS, any particular deployments): Docker The current version being used: latest

Information

  • [X] Docker
  • [ ] The CLI directly

Tasks

  • [X] An officially supported command
  • [ ] My own modifications

Reproduction

$model = "TheBloke/wizard-mega-13B-GPTQ"
$num_shard = 1
$volume = "$PWD/data"

docker run --gpus all --shm-size 1g -p 8080:80 -v $volume/data ghcr.io/huggingface/text-generation-inference:latest --model-id $model --num-shard $num_shard

Expected behavior

I would expect it to download and run the model.

bulletproofmedic avatar May 20 '23 16:05 bulletproofmedic

The first screenshot you shared says that you're trying to write on a read-only volume. Meaning it's most likely a permission problem onto your data folder.

The second screenshot shows you're trying to use ggml quantized (q4_0, q5_0, etc..) models. Those are files which cannot be read by this project. They are GGML specific. There are tools out there (can't find them atm) to convert those. But usually you should always use the non quantized weights for convertion.

Please note that there is no q4 quantization support in this project right now. We're slowly incorporating GPTQ which provides a much better result that the naive ggml quantization for Q4. (But it doesn't exist yet)

Narsil avatar May 22 '23 12:05 Narsil

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar Aug 09 '24 01:08 github-actions[bot]