Fix `TGI` (Text Generation Inference) Endpoint Inference and TGI JSON Grammar Generation
Description
While implementing a custom task using lighteval, I needed to use constrained grammar generation with TGI and it seems that TGI integration is not up-to-date and not working.
Fixes for TGI Endpoint Inference
- The
/inforoute of TGI3.0.1doesn't always return required fields such asmodel_dtype, so it was set toNoneby default if not found:
$ curl http://localhost:8080/info
{"model_id":"unsloth/Qwen2.5-0.5B-Instruct","model_sha":"6a7b5090fc11df0706c796b7ba76762d7beb688b","model_pipeline_tag":"text-generation","max_concurrent_requests":128,"max_best_of":2,"max_stop_sequences":4,"max_input_tokens":32767,"max_total_tokens":32768,"validation_workers":2,"max_client_batch_size":4,"router":"text-generation-router","version":"3.0.1","sha":"bb9095aae339579fbf3b4e7be3909932de26a7ee","docker_label":"sha-bb9095a"}
AsyncClientfrom TGI has ageneratefunction that expects multiple parameters and not a structure.- I've set
do_sample,return_full_textandwatermarkparameters asFalseby default since they come fromhuggingface_hubwhich accepts aNonedefault parameters but TGI doesn't accept them- Question for a maintainer : Should they be set as such by default? I don't see them being provided to
_async_process_requestanyway and maybe this should be fixed in another PR. Same foradapter_idfor LoRA heads.
- Question for a maintainer : Should they be set as such by default? I don't see them being provided to
- I've set
ModelClient's usage has been fixed to use theconfig: TGIModelConfigby default instead of named parameters
Fixes for TGI JSON Grammar Generation
- Updated
text_generationto0.7.0 - Added support for the grammar field to enable JSON grammar generation
Environment
Command
uv run lighteval endpoint tgi tgi.yaml "custom|...|0|0" --custom-tasks "ner_eval.py" --output-dir "results" --max-samples 10 --override-batch-size 1 --use-chat-template --save-details --no-public-run
Dependencies
dependencies = [
"datasets>=3.2.0",
"huggingface-hub>=0.27.1",
"lighteval[tgi]>=0.7.0",
"numpy>=1.26.4",
"pandas>=2.2.3",
"pydantic>=1.10.21",
"text-generation==0.6.0",
"torch>=2.4.1",
"torchvision>=0.19.1",
]
[tool.uv.sources]
lighteval = { path = "../../../../lighteval", editable = true } # This branch
model_config_path argument for TGI
tgi.yaml:
model:
instance:
inference_server_address: "http://localhost:8080"
inference_server_auth: null
model_id: null # Optional, only required if the TGI container was launched with model_id pointing to a local directory
Test Results
It works as can be seen from the logs.
TGI Logs with JSON Grammar Generation
2025-01-15T17:09:34.811955Z INFO compat_generate{default_return_full_text=true compute_type=Extension(ComputeType("1-nvidia-geforce-rtx-3060"))}:generate{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(128), return_full_text: Some(false), stop: ["\n\n", "<|im_end|>"], truncate: None, watermark: false, details: true, decoder_input_details: true, seed: None, top_n_tokens: None, grammar: Some(Json(Object {"type": String("object"), "properties": Object {"entities": Object {"type": String("array"), "items": Object {"type": String("object"), "properties": Object {"entity": Object {"type": String("string")}, "classification": Object {"type": String("string"), "enum": Array [String("merchant"), String("bank"), String("individual"), String("date"), String("location"), String("unknown")]}}, "required": Array [String("entity"), String("classification")]}}}, "required": Array [String("entities")]})), adapter_id: None } total_time="428.587752ms" validation_time="716.935µs" queue_time="82.504µs" inference_time="427.788413ms" time_per_token="25.164024ms" seed="None"}: text_generation_router::server: router/src/server.rs:422: Success
Lighteval Logs
(py3.11.3) cpcdoy@cpcdoy-desktop:~/projects/.../llm_tasks_eval$ uv run lighteval endpoint tgi tgi.yaml "custom|...|0|0" --custom-tasks "ner_eval.py" --output-dir "results" --max-samples 10 --override-batch-size 1 --use-chat-template --save-details --no-public-run
warning: `VIRTUAL_ENV=/home/cpcdoy/py3.11.3` does not match the project environment path `.venv` and will be ignored
[2025-01-15 15:11:24,861] [ INFO]: PyTorch version 2.4.1 available. (config.py:54)
[2025-01-15 15:11:28,418] [ WARNING]: --max_samples WAS SET. THESE NUMBERS ARE ONLY PARTIAL AND SHOULD NOT BE USED FOR COMPARISON UNLESS YOU KNOW WHAT YOU ARE DOING. (pipeline.py:132)
[2025-01-15 15:11:28,418] [ INFO]: --- LOADING MODEL --- (pipeline.py:168)
[2025-01-15 15:11:28,418] [ INFO]: Load model from inference server: http://localhost:8080 (model_loader.py:110)
[2025-01-15 15:11:28,846] [ INFO]: --- LOADING TASKS --- (pipeline.py:195)
[2025-01-15 15:11:28,858] [ WARNING]: If you want to use extended_tasks, make sure you installed their dependencies using `pip install -e .[extended_tasks]`. (registry.py:136)
[2025-01-15 15:11:28,858] [ INFO]: Found 1 custom tasks in /home/cpcdoy/.cache/huggingface/modules/datasets_modules/datasets/ner_eval/1739d6fd80c40f11df64fba54bf39bd05b1b1408659c4325f28f0ca9ee2a04b0/ner_eval.py (registry.py:141)
[2025-01-15 15:11:28,861] [ INFO]: ... default (lighteval_task.py:187)
[2025-01-15 15:11:28,861] [ WARNING]: Careful, the task ... is using evaluation data to build the few shot examples. (lighteval_task.py:261)
[2025-01-15 15:11:28,898] [ INFO]: --- INIT SEEDS --- (pipeline.py:224)
[2025-01-15 15:11:28,899] [ INFO]: --- RUNNING MODEL --- (pipeline.py:267)
[2025-01-15 15:11:28,899] [ INFO]: Running RequestType.GREEDY_UNTIL requests (pipeline.py:271)
[2025-01-15 15:11:28,903] [ WARNING]: You cannot select the number of dataset splits for a generative evaluation at the moment. Automatically inferring. (data.py:260)
Splits: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:04<00:00, 4.90s/it]
[2025-01-15 15:11:33,800] [ INFO]: --- COMPUTING METRICS --- (pipeline.py:299)
[2025-01-15 15:11:33,802] [ INFO]: --- DISPLAYING RESULTS --- (pipeline.py:342)
| Task |Version| Metric |Value| |Stderr|
|-----------------------------|------:|-----------------------|----:|---|-----:|
...
[2025-01-15 15:11:33,824] [ INFO]: --- SAVING AND PUSHING RESULTS --- (pipeline.py:332)
[2025-01-15 15:11:33,825] [ INFO]: Saving experiment tracker (evaluation_tracker.py:154)
[2025-01-15 15:11:33,848] [ INFO]: Saving results to ... (evaluation_tracker.py:208)
[2025-01-15 15:11:33,851] [ INFO]: Saving details to ... (evaluation_tracker.py:216)
Creating parquet from Arrow format: 100%|████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 82.46ba/s]
Note: I have anonymized parts of the logs
Updated the PR to add support for JSON Grammar Constrained Generation for TGI
UP! I encountered a similar issue where the bug prevented us from using the TGI endpoint. The key issues I found are:
-
Line 111-113 in `src/lighteval/models/model_loader.py:
The current implementation:model = ModelClient(address=config.inference_server_address, auth_token=config.inference_server_auth, model_id=config.model_id)should be updated to:
model = ModelClient(config=config)This ensures that the initialization parameters are correctly passed to
ModelClient, resolving configuration-related issues. -
model_dtypeissue:
Themodel_dtypeis not consistently available on the/inforoute of TGI, which leads to errors when the field is required. To address this,model_dtypeshould be set toNoneby default.
Exactly @naufalso , this is already solved in this PR!
+1 is this going to be merged @NathanHB ? Would really like to use lighteval with locally hosted TGI, but I'm seeing the same TypeError: ModelClient.__init__() got an unexpected keyword argument 'address' error described above.
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.
hey ! Thanks for the PR it seems good to merge, just need to fix the tests
@NathanHB Apologies for the delay, I missed you approval comment!
I've fixed the unit tests in two files that simply needed the new grammar field to be added.
I've also noticed that the langcodes dependency was missing from the multilingual extra when I ran the tests, so I added it there.
I've tried both without and with --runslow:
- without
--runslow: everything passes
➜ lighteval git:(fix/tgi_inference) ✗ uv run --extra tests --extra dev pytest -xvvs /home/cpcdoy/projects/abwab.ai/lighteval/tests/
...
================================================== 634 passed, 7 skipped, 5 warnings in 38.11s ==================================================
- with
--runslow: it seems the accuracy increased in this run, but since it's avLLMrun, I'm expecting it's unrelated? Lmk what you think.
➜ lighteval git:(fix/tgi_inference) ✗ uv run --extra tests --extra dev pytest -xvvs /home/cpcdoy/projects/abwab.ai/lighteval/tests/
...
FAILED tests/slow_tests/test_vllm_model.py::test_vllm_model[examples/model_configs/vllm_model_config.yaml] - AssertionError: Differences found: {'values_changed': {"root['lighteval:agieval:logiqa-en:0']['acc']": {'new_value': 0.3, 'old_value': 0.2}}}
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
================================================== 1 failed, 593 passed, 4 skipped, 8 warnings in 1637.09s (0:27:17) ==================================================
Hello @NathanHB , just checking if there's any news on this PR? Lmk if I need to provide any support
hey ! Sorry for the late review. I just retook a look and there's been a refacto of the codebase, this does not seems to affect your code that much but you would need to rename the request variable in the endpoint_model.py file for example.
and overall make sure it all works :)
Hey, no worries @NathanHB ! I've adapted the code to use doc instead of request variable after the refacto. I've fixed the tests to include the new grammar field (all of them pass) and I've re-run my own benchmark suite that uses lighteval with TGI as a backend to check that everything still works after the refacto. Everything looks good :)
thanks ! Last thing, can you provide a config in which you use the grammar arg ? I will test locally to make sure everything is fine on this side
@NathanHB I have actually noticed that my uv env was using an older version of some of the files of lighteval in my benchmarking suite, so I actually had to make a few more changes to accommodate for your refactoring. I've also adapted generation_parameters to work the same way you're now doing it in other endpoints.
Also, all tests pass.
Furthermore, I have also created an example usage of a custom task that uses a publicly available dataset (emotion dataset) from HF Hub on a classification task that demonstrates the newly implemented constrained grammar generation feature using TGI. I added this example in examples/custom_tasks_templates/custom_task_classification_grammar_task.py and updated examples/model_configs/tgi_model.yaml accordingly.
How to run the example
Here's how to run it from the root of the lighteval directory:
- [Optional] Remove
lightevalcache before the run:rm -rf ~/.cache/huggingface/lighteval/* - Start a TGI server first:
model="unsloth/Qwen2.5-0.5B-Instruct"
volume=./data
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:3.3.4 --model-id $model
- Run the
lightevaltask:
uv run --active --extra tgi lighteval endpoint tgi examples/model_configs/tgi_model.yaml "custom|emotion_classification|0|0" --custom-tasks examples/custom_tasks_templates/custom_task_classification_grammar_task.py --output-dir results --save-details --no-public-run --max-samples 10
Logs from the example run
TGI Logs
While running the lighteval task, you'll notice that TGI will register the request and the grammar too in its logs such as:
2025-08-20T13:50:20.563969Z INFO text_generation_router_v3::radix: backends/v3/src/radix.rs:108: Prefix 0 - Suffix 325
2025-08-20T13:50:20.665852Z INFO compat_generate{default_return_full_text=true compute_type=Extension(ComputeType("1-nvidia-geforce-rtx-3060")) context=Extension(None)}:generate{parameters=GenerateParameters { best_of: None, temperature: Some(0.1), repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: Some(0.9), typical_p: None, do_sample: false, max_new_tokens: Some(64), return_full_text: Some(false), stop: ["\n\n"], truncate: None, watermark: false, details: true, decoder_input_details: true, seed: None, top_n_tokens: None, grammar: Some(Json(Object {"type": String("object"), "properties": Object {"classification": Object {"type": String("string"), "description": String("Emotion classification from the provided list"), "enum": Array [String("sadness"), String("joy"), String("love"), String("anger"), String("fear"), String("surprise")]}}, "required": Array [String("classification")], "additionalProperties": Bool(false)})), adapter_id: None } total_time="102.706171ms" validation_time="778.556µs" queue_time="104.307µs" inference_time="101.823408ms" time_per_token="12.727926ms" seed="Some(5420590878626193495)"}: text_generation_router::server: router/src/server.rs:432: Success
lighteval logs
And lighteval will show logs such as:
[2025-08-20 15:50:20,810] [ INFO]: - Prediction: {'classification': 'joy'} (custom_task_classification_grammar_task.py:189)
[2025-08-20 15:50:20,810] [ INFO]: - Expected: joy (index: 1) (custom_task_classification_grammar_task.py:190)
[2025-08-20 15:50:20,811] [ INFO]: - Metrics: {'exact_match': 1.0, 'unknown_prediction': 0.0, 'total_samples': 1.0} (custom_task_classification_grammar_task.py:202)
[2025-08-20 15:50:20,811] [ INFO]: ✓ Correct prediction (custom_task_classification_grammar_task.py:204)
Please lmk if you have any questions!
Thank you for the reviews @NathanHB , I've applied everything! I've also improved a unit test for TGI caching by mocking the HTTP request for the /info route of the TGI server.