text-generation-inference Server error: cublasLt ran into an error!

System Info

Target: x86_64-unknown-linux-gnu Cargo version: 1.69.0 Commit sha: N/A Docker label: N/A nvidia-smi: Wed Jun 28 20:17:18 2023
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 525.105.17 Driver Version: 525.105.17 CUDA Version: 12.0 |

{ "model_id": "ehartford/WizardLM-33B-V1.0-Uncensored", "model_sha": "3eca9fdee0ce28d6a4a635a6f19d9a413caee3e7", "model_dtype": "torch.float16", "model_device_type": "cuda", "model_pipeline_tag": "text-generation", "max_concurrent_requests": 128, "max_best_of": 2, "max_stop_sequences": 4, "max_input_length": 1000, "max_total_tokens": 1512, "waiting_served_ratio": 1.2, "max_batch_total_tokens": 32000, "max_waiting_tokens": 20, "validation_workers": 2, "version": "0.8.2", "sha": "e7248fe90e27c7c8e39dd4cac5874eb9f96ab182", "docker_label": "sha-e7248fe" }

I'm using a H100 from LambdaLabs

Information

[X] Docker
[ ] The CLI directly

Tasks

[X] An officially supported command
[ ] My own modifications

Reproduction

Loading docker with --quantize:

sudo docker run --gpus all -p 8002:80 -v /home/ubuntu/data:/data ghcr.io/huggingface/text-generation-inference:0.8 --model-id ehartford/WizardLM-33B-V1.0-Uncensored --quantize bitsandbytes

Server will start fine.

However, when doing an inference request, I get:

{ "error": "Request failed during generation: Server error: cublasLt ran into an error!", "error_type": "generation" }

payload sent to /generate

`{ "inputs":"This is "

} `

Expected behavior

When doing the inference request, text generation is expected.

Jun 28 '23 20:06 karlbernard2

This could be some issue related to bitsandbytes quantization. There is a whole thread and similar linked issues: https://github.com/TimDettmers/bitsandbytes/issues/538

Jul 03 '23 04:07 rahuldshetty

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

May 10 '24 01:05 github-actions[bot]

text-generation-inference text-generation-inference copied to clipboard

Server error: cublasLt ran into an error!

System Info

Information

Tasks

Reproduction

Expected behavior

text-generation-inference
text-generation-inference copied to clipboard