device-side assert triggered when trying to use LLaMA 3.2 Vision with grammar
System Info
Version:
text-generation-launcher 2.4.0
Environment:
Target: x86_64-unknown-linux-gnu
Cargo version: 1.80.1
Commit sha: 0a655a0ab5db15f08e45d8c535e263044b944190
Docker label: sha-0a655a0
Hardware: 4 x A100
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.78 Driver Version: 550.78 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A100 80GB PCIe Off | 00000000:17:00.0 Off | 0 |
| N/A 42C P0 69W / 300W | 1MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA A100 80GB PCIe Off | 00000000:65:00.0 Off | 0 |
| N/A 43C P0 71W / 300W | 1MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA A100 80GB PCIe Off | 00000000:CA:00.0 Off | 0 |
| N/A 35C P0 61W / 300W | 1MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA A100 80GB PCIe Off | 00000000:E3:00.0 Off | 0 |
| N/A 34C P0 64W / 300W | 1MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
Deployment specificities: I am using Apptainer instead of Docker. I don't think it's responsible, since some inference queries work correctly.
Information
- [X] Docker
- [ ] The CLI directly
Tasks
- [ ] An officially supported command
- [X] My own modifications
Reproduction
Create a SIF image of the suggested version of TGI:
apptainer pull hf_tgi.sif docker://"ghcr.io/huggingface/text-generation-inference:2.4.0"
Run meta-llama/Llama-3.2-11B-Vision-Instruct model:
apptainer run --nv --env "HF_TOKEN=$$SECRET$$" --bind ./models:/data:rw hf_tgi.sif --model-id "meta-llama/Llama-3.2-11B-Vision-Instruct" --port 27685 --revision "cee5b78e6faed15d5f2e6d8a654fd5b247c0d5ca"
The model will download, and the web server will spin up.
After this, try to use curl to call the model with a grammar:
curl localhost:27685/generate -X POST -H 'Content-Type: application/json' -d '{
"inputs": "I saw a puppy a cat and a raccoon during my bike ride in the park",
"parameters": {
"repetition_penalty": 1.3,
"grammar": {
"type": "json",
"value": {
"properties": {
"location": {
"type": "string"
},
"activity": {
"type": "string"
},
"animals_seen": {
"type": "integer",
"minimum": 1,
"maximum": 5
},
"animals": {
"type": "array",
"items": {
"type": "string"
}
}
},
"required": ["location", "activity", "animals_seen", "animals"]
}
}
}
}'
TGI will then fail with a bunch of device-side assert errors and exit, cURL will return {"error":"Request failed during generation: Server error: Unexpected <class 'RuntimeError'>: CUDA error: device-side assert triggered\nCUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.\nFor debugging consider passing CUDA_LAUNCH_BLOCKING=1\nCompile with TORCH_USE_CUDA_DSA to enable device-side assertions.\n","error_type":"generation"}
Please note that normal inferences work both via curl and via OpenAI-compatible API with the same model on the same machine, so the problem is somehow related to "grammar". Using tools via the OpenAI-compatible API leads to the same exact error.
Expected behavior
The model should return a JSON output as in the example provided in the documentation.
Also experienced the same with this model, whether using grammar or just attempting to use the function_calling functionality
This is a bug in the JSON-based tool calling implementation in the context of the Vision Instruct models. Issue occurs with both 3.2 11B Vision Instruct and 90B Vision Instruct. Issue occurs with version 2.4.0 and 3.0.2 (latest version at the time of this writing).
But it works with Llama 3.1 70B (text only model):
curl http://127.0.0.1:8080/v1/chat/completions -X POST -H 'Content-Type: application/json' -d '{
"model": "tgi",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant. When a tool is required, return a JSON object with the tool name and parameters."
},
{
"role": "user",
"content": "What is the weather like in New York?"
}
],
"tools": [
{
"type": "function",
"function": {
"name": "get_current_weather",
"description": "Get the current weather",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The city and state, e.g. San Francisco, CA"
},
"format": {
"type": "string",
"enum": ["celsius", "fahrenheit"],
"description": "The temperature unit to use."
}
},
"required": ["location", "format"]
}
}
}
],
"tool_choice": "get_current_weather"
}'
{"object":"chat.completion","id":"","created":1738216950,"model":"meta-llama/Llama-3.1-70B-Instruct","system_fingerprint":"3.0.2-sha-b70f29d","choices":[{"index":0,"message":{"role":"assistant","tool_calls":[{"id":"0","type":"function","function":{"description":null,"name":"get_current_weather","arguments":{"format":"celsius","location":"New York"}}}]},"logprobs":null,"finish_reason":"stop"}],"usage":{"prompt_tokens":319,"completion_tokens":26,"total_tokens":345}}