lighteval icon indicating copy to clipboard operation
lighteval copied to clipboard

Fix `TGI` (Text Generation Inference) Endpoint Inference and TGI JSON Grammar Generation

Open cpcdoy opened this issue 11 months ago • 6 comments

Description

While implementing a custom task using lighteval, I needed to use constrained grammar generation with TGI and it seems that TGI integration is not up-to-date and not working.

Fixes for TGI Endpoint Inference

  • The /info route of TGI 3.0.1 doesn't always return required fields such as model_dtype, so it was set to None by default if not found:
$ curl http://localhost:8080/info
{"model_id":"unsloth/Qwen2.5-0.5B-Instruct","model_sha":"6a7b5090fc11df0706c796b7ba76762d7beb688b","model_pipeline_tag":"text-generation","max_concurrent_requests":128,"max_best_of":2,"max_stop_sequences":4,"max_input_tokens":32767,"max_total_tokens":32768,"validation_workers":2,"max_client_batch_size":4,"router":"text-generation-router","version":"3.0.1","sha":"bb9095aae339579fbf3b4e7be3909932de26a7ee","docker_label":"sha-bb9095a"}
  • AsyncClient from TGI has a generate function that expects multiple parameters and not a structure.
    • I've set do_sample, return_full_text and watermark parameters as False by default since they come from huggingface_hub which accepts a None default parameters but TGI doesn't accept them
      • Question for a maintainer : Should they be set as such by default? I don't see them being provided to _async_process_request anyway and maybe this should be fixed in another PR. Same for adapter_id for LoRA heads.
  • ModelClient's usage has been fixed to use the config: TGIModelConfig by default instead of named parameters

Fixes for TGI JSON Grammar Generation

  • Updated text_generation to 0.7.0
  • Added support for the grammar field to enable JSON grammar generation

Environment

Command

uv run lighteval endpoint tgi tgi.yaml "custom|...|0|0" --custom-tasks "ner_eval.py" --output-dir "results" --max-samples 10 --override-batch-size 1 --use-chat-template --save-details --no-public-run

Dependencies

dependencies = [
    "datasets>=3.2.0",
    "huggingface-hub>=0.27.1",
    "lighteval[tgi]>=0.7.0",
    "numpy>=1.26.4",
    "pandas>=2.2.3",
    "pydantic>=1.10.21",
    "text-generation==0.6.0",
    "torch>=2.4.1",
    "torchvision>=0.19.1",
]

[tool.uv.sources]
lighteval = { path = "../../../../lighteval", editable = true } # This branch

model_config_path argument for TGI

tgi.yaml:

model:
  instance:
    inference_server_address: "http://localhost:8080"
    inference_server_auth: null
    model_id: null # Optional, only required if the TGI container was launched with model_id pointing to a local directory

Test Results

It works as can be seen from the logs.

TGI Logs with JSON Grammar Generation

2025-01-15T17:09:34.811955Z  INFO compat_generate{default_return_full_text=true compute_type=Extension(ComputeType("1-nvidia-geforce-rtx-3060"))}:generate{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(128), return_full_text: Some(false), stop: ["\n\n", "<|im_end|>"], truncate: None, watermark: false, details: true, decoder_input_details: true, seed: None, top_n_tokens: None, grammar: Some(Json(Object {"type": String("object"), "properties": Object {"entities": Object {"type": String("array"), "items": Object {"type": String("object"), "properties": Object {"entity": Object {"type": String("string")}, "classification": Object {"type": String("string"), "enum": Array [String("merchant"), String("bank"), String("individual"), String("date"), String("location"), String("unknown")]}}, "required": Array [String("entity"), String("classification")]}}}, "required": Array [String("entities")]})), adapter_id: None } total_time="428.587752ms" validation_time="716.935µs" queue_time="82.504µs" inference_time="427.788413ms" time_per_token="25.164024ms" seed="None"}: text_generation_router::server: router/src/server.rs:422: Success

Lighteval Logs

(py3.11.3) cpcdoy@cpcdoy-desktop:~/projects/.../llm_tasks_eval$ uv run lighteval endpoint tgi tgi.yaml "custom|...|0|0" --custom-tasks "ner_eval.py" --output-dir "results" --max-samples 10 --override-batch-size 1 --use-chat-template --save-details --no-public-run
warning: `VIRTUAL_ENV=/home/cpcdoy/py3.11.3` does not match the project environment path `.venv` and will be ignored
[2025-01-15 15:11:24,861] [    INFO]: PyTorch version 2.4.1 available. (config.py:54)
[2025-01-15 15:11:28,418] [ WARNING]: --max_samples WAS SET. THESE NUMBERS ARE ONLY PARTIAL AND SHOULD NOT BE USED FOR COMPARISON UNLESS YOU KNOW WHAT YOU ARE DOING. (pipeline.py:132)
[2025-01-15 15:11:28,418] [    INFO]: --- LOADING MODEL --- (pipeline.py:168)
[2025-01-15 15:11:28,418] [    INFO]: Load model from inference server: http://localhost:8080 (model_loader.py:110)
[2025-01-15 15:11:28,846] [    INFO]: --- LOADING TASKS --- (pipeline.py:195)
[2025-01-15 15:11:28,858] [ WARNING]: If you want to use extended_tasks, make sure you installed their dependencies using `pip install -e .[extended_tasks]`. (registry.py:136)
[2025-01-15 15:11:28,858] [    INFO]: Found 1 custom tasks in /home/cpcdoy/.cache/huggingface/modules/datasets_modules/datasets/ner_eval/1739d6fd80c40f11df64fba54bf39bd05b1b1408659c4325f28f0ca9ee2a04b0/ner_eval.py (registry.py:141)
[2025-01-15 15:11:28,861] [    INFO]: ... default (lighteval_task.py:187)
[2025-01-15 15:11:28,861] [ WARNING]: Careful, the task ... is using evaluation data to build the few shot examples. (lighteval_task.py:261)
[2025-01-15 15:11:28,898] [    INFO]: --- INIT SEEDS --- (pipeline.py:224)
[2025-01-15 15:11:28,899] [    INFO]: --- RUNNING MODEL --- (pipeline.py:267)
[2025-01-15 15:11:28,899] [    INFO]: Running RequestType.GREEDY_UNTIL requests (pipeline.py:271)
[2025-01-15 15:11:28,903] [ WARNING]: You cannot select the number of dataset splits for a generative evaluation at the moment. Automatically inferring. (data.py:260)
Splits: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:04<00:00,  4.90s/it]
[2025-01-15 15:11:33,800] [    INFO]: --- COMPUTING METRICS --- (pipeline.py:299)                                                                  
[2025-01-15 15:11:33,802] [    INFO]: --- DISPLAYING RESULTS --- (pipeline.py:342)
|            Task             |Version|        Metric         |Value|   |Stderr|
|-----------------------------|------:|-----------------------|----:|---|-----:|
...

[2025-01-15 15:11:33,824] [    INFO]: --- SAVING AND PUSHING RESULTS --- (pipeline.py:332)
[2025-01-15 15:11:33,825] [    INFO]: Saving experiment tracker (evaluation_tracker.py:154)
[2025-01-15 15:11:33,848] [    INFO]: Saving results to ... (evaluation_tracker.py:208)
[2025-01-15 15:11:33,851] [    INFO]: Saving details to ... (evaluation_tracker.py:216)
Creating parquet from Arrow format: 100%|████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 82.46ba/s]

Note: I have anonymized parts of the logs

cpcdoy avatar Jan 15 '25 14:01 cpcdoy

Updated the PR to add support for JSON Grammar Constrained Generation for TGI

cpcdoy avatar Jan 15 '25 17:01 cpcdoy

UP! I encountered a similar issue where the bug prevented us from using the TGI endpoint. The key issues I found are:

  • Line 111-113 in `src/lighteval/models/model_loader.py:
    The current implementation:

    model = ModelClient(address=config.inference_server_address, auth_token=config.inference_server_auth, model_id=config.model_id)  
    

    should be updated to:

    model = ModelClient(config=config)  
    

    This ensures that the initialization parameters are correctly passed to ModelClient, resolving configuration-related issues.

  • model_dtype issue:
    The model_dtype is not consistently available on the /info route of TGI, which leads to errors when the field is required. To address this, model_dtype should be set to None by default.

naufalso avatar Feb 07 '25 09:02 naufalso

Exactly @naufalso , this is already solved in this PR!

cpcdoy avatar Feb 07 '25 18:02 cpcdoy

+1 is this going to be merged @NathanHB ? Would really like to use lighteval with locally hosted TGI, but I'm seeing the same TypeError: ModelClient.__init__() got an unexpected keyword argument 'address' error described above.

ZQ-Dev8 avatar Apr 02 '25 18:04 ZQ-Dev8

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

hey ! Thanks for the PR it seems good to merge, just need to fix the tests

NathanHB avatar Apr 17 '25 10:04 NathanHB

@NathanHB Apologies for the delay, I missed you approval comment!

I've fixed the unit tests in two files that simply needed the new grammar field to be added.

I've also noticed that the langcodes dependency was missing from the multilingual extra when I ran the tests, so I added it there.

I've tried both without and with --runslow:

  • without --runslow: everything passes
➜  lighteval git:(fix/tgi_inference) ✗ uv run --extra tests --extra dev pytest -xvvs /home/cpcdoy/projects/abwab.ai/lighteval/tests/

...

================================================== 634 passed, 7 skipped, 5 warnings in 38.11s ==================================================
  • with --runslow: it seems the accuracy increased in this run, but since it's a vLLM run, I'm expecting it's unrelated? Lmk what you think.
➜  lighteval git:(fix/tgi_inference) ✗ uv run --extra tests --extra dev pytest -xvvs /home/cpcdoy/projects/abwab.ai/lighteval/tests/

...
FAILED tests/slow_tests/test_vllm_model.py::test_vllm_model[examples/model_configs/vllm_model_config.yaml] - AssertionError: Differences found: {'values_changed': {"root['lighteval:agieval:logiqa-en:0']['acc']": {'new_value': 0.3, 'old_value': 0.2}}}
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
================================================== 1 failed, 593 passed, 4 skipped, 8 warnings in 1637.09s (0:27:17) ==================================================

cpcdoy avatar Jun 11 '25 09:06 cpcdoy

Hello @NathanHB , just checking if there's any news on this PR? Lmk if I need to provide any support

cpcdoy avatar Jul 08 '25 11:07 cpcdoy

hey ! Sorry for the late review. I just retook a look and there's been a refacto of the codebase, this does not seems to affect your code that much but you would need to rename the request variable in the endpoint_model.py file for example. and overall make sure it all works :)

NathanHB avatar Aug 12 '25 15:08 NathanHB

Hey, no worries @NathanHB ! I've adapted the code to use doc instead of request variable after the refacto. I've fixed the tests to include the new grammar field (all of them pass) and I've re-run my own benchmark suite that uses lighteval with TGI as a backend to check that everything still works after the refacto. Everything looks good :)

cpcdoy avatar Aug 15 '25 09:08 cpcdoy

thanks ! Last thing, can you provide a config in which you use the grammar arg ? I will test locally to make sure everything is fine on this side

NathanHB avatar Aug 18 '25 10:08 NathanHB

@NathanHB I have actually noticed that my uv env was using an older version of some of the files of lighteval in my benchmarking suite, so I actually had to make a few more changes to accommodate for your refactoring. I've also adapted generation_parameters to work the same way you're now doing it in other endpoints.

Also, all tests pass.

Furthermore, I have also created an example usage of a custom task that uses a publicly available dataset (emotion dataset) from HF Hub on a classification task that demonstrates the newly implemented constrained grammar generation feature using TGI. I added this example in examples/custom_tasks_templates/custom_task_classification_grammar_task.py and updated examples/model_configs/tgi_model.yaml accordingly.

How to run the example

Here's how to run it from the root of the lighteval directory:

  • [Optional] Remove lighteval cache before the run: rm -rf ~/.cache/huggingface/lighteval/*
  • Start a TGI server first:
model="unsloth/Qwen2.5-0.5B-Instruct"
volume=./data

docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:3.3.4 --model-id $model
  • Run the lighteval task:
uv run --active --extra tgi lighteval endpoint tgi examples/model_configs/tgi_model.yaml "custom|emotion_classification|0|0" --custom-tasks examples/custom_tasks_templates/custom_task_classification_grammar_task.py --output-dir results --save-details --no-public-run --max-samples 10

Logs from the example run

TGI Logs

While running the lighteval task, you'll notice that TGI will register the request and the grammar too in its logs such as:

2025-08-20T13:50:20.563969Z  INFO text_generation_router_v3::radix: backends/v3/src/radix.rs:108: Prefix 0 - Suffix 325
2025-08-20T13:50:20.665852Z  INFO compat_generate{default_return_full_text=true compute_type=Extension(ComputeType("1-nvidia-geforce-rtx-3060")) context=Extension(None)}:generate{parameters=GenerateParameters { best_of: None, temperature: Some(0.1), repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: Some(0.9), typical_p: None, do_sample: false, max_new_tokens: Some(64), return_full_text: Some(false), stop: ["\n\n"], truncate: None, watermark: false, details: true, decoder_input_details: true, seed: None, top_n_tokens: None, grammar: Some(Json(Object {"type": String("object"), "properties": Object {"classification": Object {"type": String("string"), "description": String("Emotion classification from the provided list"), "enum": Array [String("sadness"), String("joy"), String("love"), String("anger"), String("fear"), String("surprise")]}}, "required": Array [String("classification")], "additionalProperties": Bool(false)})), adapter_id: None } total_time="102.706171ms" validation_time="778.556µs" queue_time="104.307µs" inference_time="101.823408ms" time_per_token="12.727926ms" seed="Some(5420590878626193495)"}: text_generation_router::server: router/src/server.rs:432: Success

lighteval logs

And lighteval will show logs such as:

[2025-08-20 15:50:20,810] [    INFO]: - Prediction: {'classification': 'joy'} (custom_task_classification_grammar_task.py:189)
[2025-08-20 15:50:20,810] [    INFO]: - Expected: joy (index: 1) (custom_task_classification_grammar_task.py:190)
[2025-08-20 15:50:20,811] [    INFO]: - Metrics: {'exact_match': 1.0, 'unknown_prediction': 0.0, 'total_samples': 1.0} (custom_task_classification_grammar_task.py:202)
[2025-08-20 15:50:20,811] [    INFO]: ✓ Correct prediction (custom_task_classification_grammar_task.py:204)

Please lmk if you have any questions!

cpcdoy avatar Aug 20 '25 13:08 cpcdoy

Thank you for the reviews @NathanHB , I've applied everything! I've also improved a unit test for TGI caching by mocking the HTTP request for the /info route of the TGI server.

cpcdoy avatar Aug 21 '25 14:08 cpcdoy