text-generation-inference icon indicating copy to clipboard operation
text-generation-inference copied to clipboard

Inexplicable 'incomplete generation' error

Open mwm5945 opened this issue 8 months ago • 2 comments

System Info

Sagemaker Realtime Inference endpoints TGI Version 2.4.1 p4d: 4 A100, 96 CPU, 1152 GB mem

MAX_INPUT_LENGTH: '16128' MAX_TOTAL_TOKENS: '16384'

Information

  • [x] Docker
  • [ ] The CLI directly

Tasks

  • [x] An officially supported command
  • [ ] My own modifications

Reproduction

Unsure, as it happens sporadically, with no clear correlation to payload/prompts, traffic, or anything else. After the incomplete generation takes place, all subsequent requests timeout, and the only way to recover is to restart the endpoint. This has mainly occurred with Llama3.1 70b,

Example Logs:

Success
{
    "timestamp": "2025-02-21T01:15:43.569760Z",
    "level": "INFO",
    "message": "Success",
    "target": "text_generation_router::server",
    "filename": "router/src/server.rs",
    "line_number": 407,
    "span": {
        "inference_time": "286.497811ms",
        "parameters": "GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(1), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None, adapter_id: None }",
        "queue_time": "1.276254517s",
        "seed": "None",
        "time_per_token": "286.497811ms",
        "total_time": "1.563136309s",
        "validation_time": "384.15µs",
        "name": "generate"
    },
    "spans": [
        {
            "inference_time": "286.497811ms",
            "parameters": "GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(1), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None, adapter_id: None }",
            "queue_time": "1.276254517s",
            "seed": "None",
            "time_per_token": "286.497811ms",
            "total_time": "1.563136309s",
            "validation_time": "384.15µs",
            "name": "generate"
        }
    ]
}
Custom Attributes (printed out headers returned from TGI from calling app)
2025/02/21 01:15:43 custom attributes: {
    "X-Compute-Characters": [
        "11"
    ],
    "X-Compute-Time": [
        "1.563136309"
    ],
    "X-Compute-Type": [
        "8-nvidia-a100-sxm4-40gb"
    ],
    "X-Generated-Tokens": [
        "1"
    ],
    "X-Inference-Time": [
        "286"
    ],
    "X-Prompt-Tokens": [
        "3"
    ],
    "X-Queue-Time": [
        "1276"
    ],
    "X-Time-Per-Token": [
        "286"
    ],
    "X-Total-Time": [
        "1563"
    ],
    "X-Validation-Time": [
        "0"
    ]
}
Incomplete generation
{
    "timestamp": "2025-02-21T01:15:43.600876Z",
    "level": "ERROR",
    "message": "Incomplete generation",
    "target": "text_generation_router::infer",
    "filename": "router/src/infer/mod.rs",
    "line_number": 246,
    "span": {
        "name": "generate"
    },
    "spans": [
        {
            "parameters": "GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(800), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None, adapter_id: None }",
            "name": "generate"
        },
        {
            "name": "generate"
        }
    ]
}
Incomplete Generation
{
    "timestamp": "2025-02-21T01:15:43.600876Z",
    "level": "ERROR",
    "message": "Incomplete generation",
    "target": "text_generation_router::infer",
    "filename": "router/src/infer/mod.rs",
    "line_number": 246,
    "span": {
        "name": "generate"
    },
    "spans": [
        {
            "parameters": "GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(800), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None, adapter_id: None }",
            "name": "generate"
        },
        {
            "name": "generate"
        }
    ]
}

Expected behavior

incomplete generations do not block the entire model

mwm5945 avatar Feb 21 '25 21:02 mwm5945