text-generation-inference
text-generation-inference copied to clipboard
Inexplicable 'incomplete generation' error
System Info
Sagemaker Realtime Inference endpoints TGI Version 2.4.1 p4d: 4 A100, 96 CPU, 1152 GB mem
MAX_INPUT_LENGTH: '16128' MAX_TOTAL_TOKENS: '16384'
Information
- [x] Docker
- [ ] The CLI directly
Tasks
- [x] An officially supported command
- [ ] My own modifications
Reproduction
Unsure, as it happens sporadically, with no clear correlation to payload/prompts, traffic, or anything else. After the incomplete generation takes place, all subsequent requests timeout, and the only way to recover is to restart the endpoint. This has mainly occurred with Llama3.1 70b,
Example Logs:
Success
{
"timestamp": "2025-02-21T01:15:43.569760Z",
"level": "INFO",
"message": "Success",
"target": "text_generation_router::server",
"filename": "router/src/server.rs",
"line_number": 407,
"span": {
"inference_time": "286.497811ms",
"parameters": "GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(1), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None, adapter_id: None }",
"queue_time": "1.276254517s",
"seed": "None",
"time_per_token": "286.497811ms",
"total_time": "1.563136309s",
"validation_time": "384.15µs",
"name": "generate"
},
"spans": [
{
"inference_time": "286.497811ms",
"parameters": "GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(1), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None, adapter_id: None }",
"queue_time": "1.276254517s",
"seed": "None",
"time_per_token": "286.497811ms",
"total_time": "1.563136309s",
"validation_time": "384.15µs",
"name": "generate"
}
]
}
Custom Attributes
(printed out headers returned from TGI from calling app)2025/02/21 01:15:43 custom attributes: {
"X-Compute-Characters": [
"11"
],
"X-Compute-Time": [
"1.563136309"
],
"X-Compute-Type": [
"8-nvidia-a100-sxm4-40gb"
],
"X-Generated-Tokens": [
"1"
],
"X-Inference-Time": [
"286"
],
"X-Prompt-Tokens": [
"3"
],
"X-Queue-Time": [
"1276"
],
"X-Time-Per-Token": [
"286"
],
"X-Total-Time": [
"1563"
],
"X-Validation-Time": [
"0"
]
}
Incomplete generation
{
"timestamp": "2025-02-21T01:15:43.600876Z",
"level": "ERROR",
"message": "Incomplete generation",
"target": "text_generation_router::infer",
"filename": "router/src/infer/mod.rs",
"line_number": 246,
"span": {
"name": "generate"
},
"spans": [
{
"parameters": "GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(800), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None, adapter_id: None }",
"name": "generate"
},
{
"name": "generate"
}
]
}
Incomplete Generation
{
"timestamp": "2025-02-21T01:15:43.600876Z",
"level": "ERROR",
"message": "Incomplete generation",
"target": "text_generation_router::infer",
"filename": "router/src/infer/mod.rs",
"line_number": 246,
"span": {
"name": "generate"
},
"spans": [
{
"parameters": "GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(800), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None, adapter_id: None }",
"name": "generate"
},
{
"name": "generate"
}
]
}
Expected behavior
incomplete generations do not block the entire model