text-generation-inference
text-generation-inference copied to clipboard
Server error: cublasLt ran into an error!
System Info
Target: x86_64-unknown-linux-gnu
Cargo version: 1.69.0
Commit sha: N/A
Docker label: N/A
nvidia-smi:
Wed Jun 28 20:17:18 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17 Driver Version: 525.105.17 CUDA Version: 12.0 |
{ "model_id": "ehartford/WizardLM-33B-V1.0-Uncensored", "model_sha": "3eca9fdee0ce28d6a4a635a6f19d9a413caee3e7", "model_dtype": "torch.float16", "model_device_type": "cuda", "model_pipeline_tag": "text-generation", "max_concurrent_requests": 128, "max_best_of": 2, "max_stop_sequences": 4, "max_input_length": 1000, "max_total_tokens": 1512, "waiting_served_ratio": 1.2, "max_batch_total_tokens": 32000, "max_waiting_tokens": 20, "validation_workers": 2, "version": "0.8.2", "sha": "e7248fe90e27c7c8e39dd4cac5874eb9f96ab182", "docker_label": "sha-e7248fe" }
I'm using a H100 from LambdaLabs
Information
- [X] Docker
- [ ] The CLI directly
Tasks
- [X] An officially supported command
- [ ] My own modifications
Reproduction
Loading docker with --quantize:
sudo docker run --gpus all -p 8002:80 -v /home/ubuntu/data:/data ghcr.io/huggingface/text-generation-inference:0.8 --model-id ehartford/WizardLM-33B-V1.0-Uncensored --quantize bitsandbytes
Server will start fine.
However, when doing an inference request, I get:
{ "error": "Request failed during generation: Server error: cublasLt ran into an error!", "error_type": "generation" }
payload sent to /generate
`{ "inputs":"This is "
} `
Expected behavior
When doing the inference request, text generation is expected.
This could be some issue related to bitsandbytes quantization. There is a whole thread and similar linked issues: https://github.com/TimDettmers/bitsandbytes/issues/538
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.