text-generation-inference device-side assert triggered when trying to use LLaMA 3.2 Vision with grammar

System Info

Version: text-generation-launcher 2.4.0

Environment:

Target: x86_64-unknown-linux-gnu
Cargo version: 1.80.1
Commit sha: 0a655a0ab5db15f08e45d8c535e263044b944190
Docker label: sha-0a655a0

Hardware: 4 x A100

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.78                 Driver Version: 550.78         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100 80GB PCIe          Off |   00000000:17:00.0 Off |                    0 |
| N/A   42C    P0             69W /  300W |       1MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A100 80GB PCIe          Off |   00000000:65:00.0 Off |                    0 |
| N/A   43C    P0             71W /  300W |       1MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA A100 80GB PCIe          Off |   00000000:CA:00.0 Off |                    0 |
| N/A   35C    P0             61W /  300W |       1MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA A100 80GB PCIe          Off |   00000000:E3:00.0 Off |                    0 |
| N/A   34C    P0             64W /  300W |       1MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Deployment specificities: I am using Apptainer instead of Docker. I don't think it's responsible, since some inference queries work correctly.

Information

[X] Docker
[ ] The CLI directly

Tasks

[ ] An officially supported command
[X] My own modifications

Reproduction

Create a SIF image of the suggested version of TGI: apptainer pull hf_tgi.sif docker://"ghcr.io/huggingface/text-generation-inference:2.4.0"

Run meta-llama/Llama-3.2-11B-Vision-Instruct model: apptainer run --nv --env "HF_TOKEN=$$SECRET$$" --bind ./models:/data:rw hf_tgi.sif --model-id "meta-llama/Llama-3.2-11B-Vision-Instruct" --port 27685 --revision "cee5b78e6faed15d5f2e6d8a654fd5b247c0d5ca"

The model will download, and the web server will spin up.

After this, try to use curl to call the model with a grammar:

curl localhost:27685/generate     -X POST     -H 'Content-Type: application/json'     -d '{
    "inputs": "I saw a puppy a cat and a raccoon during my bike ride in the park",
    "parameters": {
        "repetition_penalty": 1.3,
        "grammar": {
            "type": "json",
            "value": {
                "properties": {
                    "location": {
                        "type": "string"
                    },
                    "activity": {
                        "type": "string"
                    },
                    "animals_seen": {
                        "type": "integer",
                        "minimum": 1,
                        "maximum": 5
                    },
                    "animals": {
                        "type": "array",
                        "items": {
                            "type": "string"
                        }
                    }
                },
                "required": ["location", "activity", "animals_seen", "animals"]
            }
        }
    }
}'

TGI will then fail with a bunch of device-side assert errors and exit, cURL will return {"error":"Request failed during generation: Server error: Unexpected <class 'RuntimeError'>: CUDA error: device-side assert triggered\nCUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.\nFor debugging consider passing CUDA_LAUNCH_BLOCKING=1\nCompile with TORCH_USE_CUDA_DSA to enable device-side assertions.\n","error_type":"generation"}

Please note that normal inferences work both via curl and via OpenAI-compatible API with the same model on the same machine, so the problem is somehow related to "grammar". Using tools via the OpenAI-compatible API leads to the same exact error.

Expected behavior

The model should return a JSON output as in the example provided in the documentation.

Nov 06 '24 23:11 SokolAnn

Also experienced the same with this model, whether using grammar or just attempting to use the function_calling functionality

Nov 25 '24 11:11 Johnno1011

This is a bug in the JSON-based tool calling implementation in the context of the Vision Instruct models. Issue occurs with both 3.2 11B Vision Instruct and 90B Vision Instruct. Issue occurs with version 2.4.0 and 3.0.2 (latest version at the time of this writing).

But it works with Llama 3.1 70B (text only model):

curl http://127.0.0.1:8080/v1/chat/completions     -X POST     -H 'Content-Type: application/json'     -d '{
        "model": "tgi",
        "messages": [
            {
                "role": "system",
                "content": "You are a helpful assistant. When a tool is required, return a JSON object with the tool name and parameters."
            },
            {
                "role": "user",
                "content": "What is the weather like in New York?"
            }
        ],
        "tools": [
            {
                "type": "function",
                "function": {
                    "name": "get_current_weather",
                    "description": "Get the current weather",
                    "parameters": {
                        "type": "object",
                        "properties": {
                            "location": {
                                "type": "string",
                                "description": "The city and state, e.g. San Francisco, CA"
                            },
                            "format": {
                                "type": "string",
                                "enum": ["celsius", "fahrenheit"],
                                "description": "The temperature unit to use."
                            }
                        },
                        "required": ["location", "format"]
                    }
                }
            }
        ],
        "tool_choice": "get_current_weather"
    }'
{"object":"chat.completion","id":"","created":1738216950,"model":"meta-llama/Llama-3.1-70B-Instruct","system_fingerprint":"3.0.2-sha-b70f29d","choices":[{"index":0,"message":{"role":"assistant","tool_calls":[{"id":"0","type":"function","function":{"description":null,"name":"get_current_weather","arguments":{"format":"celsius","location":"New York"}}}]},"logprobs":null,"finish_reason":"stop"}],"usage":{"prompt_tokens":319,"completion_tokens":26,"total_tokens":345}}

Jan 30 '25 05:01 DanielViglione