server icon indicating copy to clipboard operation
server copied to clipboard

Infinite Reaper Loop with Sequence Batcher.

Open tjad opened this issue 3 months ago • 12 comments

Description At some point in time the sequence batcher goes into an infinite loop trying to clear stale sequence IDs. This is critical/fatal issue which prevents the model from any further processing.

Alternative models also using the sequence batcher still seem to work fine, it is only the single model's instance of sequence batcher that is impacted.

In previous versions of Triton, this bug existed too, except it would send the main triton process into an infinite loop state (using 100% CPU) and prevent all models from working. So triton seems better now that the problem is contained/confined to only a single model, and it does not affect other models from continueing to process.

We have multiple models running with the sequence batcher. 1 of the 3 models is impacted in isolation where its sequence batcher does this and all further processing for that model is halted.

I1009 03:39:04.357965 42 sequence_batch_scheduler.cc:1079] "Reaper: CORRID 15: max sequence idle exceeded"                                                                                                                                                 │
│ I1009 03:39:04.357969 42 sequence_batch_scheduler.cc:1101] "Reaper: found idle CORRID 15"                                                                                                                                                                  │
│ I1009 03:39:04.357972 42 sequence_batch_scheduler.cc:1079] "Reaper: CORRID 41: max sequence idle exceeded"                                                                                                                                                 │
│ I1009 03:39:04.357977 42 sequence_batch_scheduler.cc:1101] "Reaper: found idle CORRID 41"                                                                                                                                                                  │
│ I1009 03:39:04.357980 42 sequence_batch_scheduler.cc:1079] "Reaper: CORRID 13: max sequence idle exceeded"                                                                                                                                                 │
│ I1009 03:39:04.357984 42 sequence_batch_scheduler.cc:1101] "Reaper: found idle CORRID 13"                                                                                                                                                                  │
│ I1009 03:39:04.357988 42 sequence_batch_scheduler.cc:1079] "Reaper: CORRID 39: max sequence idle exceeded"                                                                                                                                                 │
│ I1009 03:39:04.357992 42 sequence_batch_scheduler.cc:1101] "Reaper: found idle CORRID 39"                                                                                                                                                                  │
│ I1009 03:39:04.357996 42 sequence_batch_scheduler.cc:1079] "Reaper: CORRID 11: max sequence idle exceeded"                                                                                                                                                 │
│ I1009 03:39:04.358000 42 sequence_batch_scheduler.cc:1101] "Reaper: found idle CORRID 11"                                                                                                                                                                  │
│ I1009 03:39:04.358004 42 sequence_batch_scheduler.cc:1079] "Reaper: CORRID 37: max sequence idle exceeded"                                                                                                                                                 │
│ I1009 03:39:04.358008 42 sequence_batch_scheduler.cc:1101] "Reaper: found idle CORRID 37"                                                                                                                                                                  │
│ I1009 03:39:04.408104 42 sequence_batch_scheduler.cc:1079] "Reaper: CORRID 17: max sequence idle exceeded"                                                                                                                                                 │
│ I1009 03:39:04.408124 42 sequence_batch_scheduler.cc:1101] "Reaper: found idle CORRID 17"                                                                                                                                                                  │
│ I1009 03:39:04.408128 42 sequence_batch_scheduler.cc:1079] "Reaper: CORRID 9: max sequence idle exceeded"                                                                                                                                                  │
│ I1009 03:39:04.408132 42 sequence_batch_scheduler.cc:1101] "Reaper: found idle CORRID 9"                                                                                                                                                                   │
│ I1009 03:39:04.408136 42 sequence_batch_scheduler.cc:1079] "Reaper: CORRID 5: max sequence idle exceeded"                                                                                                                                                  │
│ I1009 03:39:04.408140 42 sequence_batch_scheduler.cc:1101] "Reaper: found idle CORRID 5"                                                                                                                                                                   │
│ I1009 03:39:04.408144 42 sequence_batch_scheduler.cc:1079] "Reaper: CORRID 3: max sequence idle exceeded"                                                                                                                                                  │
│ I1009 03:39:04.408148 42 sequence_batch_scheduler.cc:1101] "Reaper: found idle CORRID 3"                                                                                                                                                                   │
│ I1009 03:39:04.408152 42 sequence_batch_scheduler.cc:1079] "Reaper: CORRID 1: max sequence idle exceeded"                                                                                                                                                  │
│ I1009 03:39:04.408156 42 sequence_batch_scheduler.cc:1101] "Reaper: found idle CORRID 1"                                                                                                                                                                   │
│ I1009 03:39:04.408160 42 sequence_batch_scheduler.cc:1079] "Reaper: CORRID 15: max sequence idle exceeded"                                                                                                                                                 │
│ I1009 03:39:04.408163 42 sequence_batch_scheduler.cc:1101] "Reaper: found idle CORRID 15"                                                                                                                                                                  │
│ I1009 03:39:04.408167 42 sequence_batch_scheduler.cc:1079] "Reaper: CORRID 41: max sequence idle exceeded"                                                                                                                                                 │
│ I1009 03:39:04.408171 42 sequence_batch_scheduler.cc:1101] "Reaper: found idle CORRID 41"                                                                                                                                                                  │
│ I1009 03:39:04.408175 42 sequence_batch_scheduler.cc:1079] "Reaper: CORRID 13: max sequence idle exceeded"                                                                                                                                                 │
│ I1009 03:39:04.408200 42 sequence_batch_scheduler.cc:1101] "Reaper: found idle CORRID 13"                                                                                                                                                                  │
│ I1009 03:39:04.408205 42 sequence_batch_scheduler.cc:1079] "Reaper: CORRID 39: max sequence idle exceeded"                                                                                                                                                 │
│ I1009 03:39:04.408209 42 sequence_batch_scheduler.cc:1101] "Reaper: found idle CORRID 39"                                                                                                                                                                  │
│ I1009 03:39:04.408213 42 sequence_batch_scheduler.cc:1079] "Reaper: CORRID 11: max sequence idle exceeded"                                                                                                                                                 │
│ I1009 03:39:04.408217 42 sequence_batch_scheduler.cc:1101] "Reaper: found idle CORRID 11"                                                                                                                                                                  │
│ I1009 03:39:04.408220 42 sequence_batch_scheduler.cc:1079] "Reaper: CORRID 37: max sequence idle exceeded"                                                                                                                                                 │
│ I1009 03:39:04.408225 42 sequence_batch_scheduler.cc:1101] "Reaper: found idle CORRID 37"                                                                                                                                                                  │
│ I1009 03:39:04.458325 42 sequence_batch_scheduler.cc:1079] "Reaper: CORRID 17: max sequence idle exceeded"                                                                                                                                                 │
│ I1009 03:39:04.458344 42 sequence_batch_scheduler.cc:1101] "Reaper: found idle CORRID 17"                                                                                                                                                                  │
│ I1009 03:39:04.458349 42 sequence_batch_scheduler.cc:1079] "Reaper: CORRID 9: max sequence idle exceeded"                                                                                                                                                  │
│ I1009 03:39:04.458354 42 sequence_batch_scheduler.cc:1101] "Reaper: found idle CORRID 9"                                                                                                                                                                   │
│ I1009 03:39:04.458357 42 sequence_batch_scheduler.cc:1079] "Reaper: CORRID 5: max sequence idle exceeded"                                                                                                                                                  │
│ I1009 03:39:04.458361 42 sequence_batch_scheduler.cc:1101] "Reaper: found idle CORRID 5"                                                                                                                                                                   │
│ I1009 03:39:04.458365 42 sequence_batch_scheduler.cc:1079] "Reaper: CORRID 3: max sequence idle exceeded"                                                                                                                                                  │
│ I1009 03:39:04.458369 42 sequence_batch_scheduler.cc:1101] "Reaper: found idle CORRID 3"                                                                                                                                                                   │
│ I1009 03:39:04.458373 42 sequence_batch_scheduler.cc:1079] "Reaper: CORRID 1: max sequence idle exceeded"                                                                                                                                                  │
│ I1009 03:39:04.458376 42 sequence_batch_scheduler.cc:1101] "Reaper: found idle CORRID 1"                                                                                                                                                                   │
│ I1009 03:39:04.458380 42 sequence_batch_scheduler.cc:1079] "Reaper: CORRID 15: max sequence idle exceeded"                                                                                                                                                 │
│ I1009 03:39:04.458384 42 sequence_batch_scheduler.cc:1101] "Reaper: found idle CORRID 15"                                                                                                                                                                  │
│ I1009 03:39:04.458388 42 sequence_batch_scheduler.cc:1079] "Reaper: CORRID 41: max sequence idle exceeded"                                                                                                                                                 │
│ I1009 03:39:04.458392 42 sequence_batch_scheduler.cc:1101] "Reaper: found idle CORRID 41"                                                                                                                                                                  │
│ I1009 03:39:04.458396 42 sequence_batch_scheduler.cc:1079] "Reaper: CORRID 13: max sequence idle exceeded"                                                                                                                                                 │
│ I1009 03:39:04.458400 42 sequence_batch_scheduler.cc:1101] "Reaper: found idle CORRID 13"                                                                                                                                                                  │
│ I1009 03:39:04.458404 42 sequence_batch_scheduler.cc:1079] "Reaper: CORRID 39: max sequence idle exceeded"                                                                                                                                                 │
│ I1009 03:39:04.458408 42 sequence_batch_scheduler.cc:1101] "Reaper: found idle CORRID 39"                                                                                                                                                                  │
│ I1009 03:39:04.458412 42 sequence_batch_scheduler.cc:1079] "Reaper: CORRID 11: max sequence idle exceeded"                                                                                                                                                 │
│ I1009 03:39:04.458416 42 sequence_batch_scheduler.cc:1101] "Reaper: found idle CORRID 11"                                                                                                                                                                  │
│ I1009 03:39:04.458420 42 sequence_batch_scheduler.cc:1079] "Reaper: CORRID 37: max sequence idle exceeded"                                                                                                                                                 │
│ I1009 03:39:04.458424 42 sequence_batch_scheduler.cc:1101] "Reaper: found idle CORRID 37" 

Triton Information 2.59 from container (nvcr.io/nvidia/tritonserver:25.07-py3)

Running on GCP with 2x L4 GPU. Each model is deployed to 1 GPU only - no models are duplicated either on a single GPU or across GPU.

Are you using the Triton container or did you build it yourself? nvcr.io/nvidia/tritonserver:25.07-py3

To Reproduce Not sure. I suspect the END sequence is not sent. This happens intermittently and is difficult to reproduce on different environments using the identical same image. Eventually it results in this state. Indicates a sort of race condition.

I have tried to reproduce the issue by sending and not sending the various sequence batcher controls, END, START, or omitting them. We only use START/END.

Describe the models (framework, inputs, outputs), ideally include the model configuration file (if using an ensemble include the model configuration file for that as well).

sequence_batching{
    oldest{
      max_queue_delay_microseconds: 10000
    }
    max_sequence_idle_microseconds: 5000000
    control_input [
        {
            name: "START",
            control [
                {
                    kind: CONTROL_SEQUENCE_START
                    fp32_false_true: [0, 1]
                }
            ]
        },
        {
            name: "READY"
            control [
                {
                    kind: CONTROL_SEQUENCE_READY
                    fp32_false_true: [0, 1]
                }
            ]
        },
        {
            name: "CORRID",
            control [
                {
                    kind: CONTROL_SEQUENCE_CORRID
                    data_type: TYPE_UINT64
                }
            ]
        },
        {
            name: "END",
            control [
                {
                    kind: CONTROL_SEQUENCE_END
                    fp32_false_true: [0, 1]
                }
            ]
        }
    ]
}

Expected behavior Once the CORRID is reaped, it should not need to reap for that CORRID again. So that subsequent sequences can be processed.

tjad avatar Oct 09 '25 03:10 tjad

Not sure if it helps, but I have now asked gpt/copilot to analyse the code. It thinks there is indeed a race condition. This is perhaps worth reviewing ?

It checked an older version at first - related to (23.05 image) https://copilot.microsoft.com/shares/vCPJukJKCHs83HqQAP95J

But then got it to check version 2.59 code and it ascertains a race condition still exists. https://copilot.microsoft.com/shares/YBW2LHFQtBkwMRCHxoPgo

I know it could be misinterpreting, but could be worth having someone confirm this - if I get time, I will review the code too to confirm this.

tjad avatar Oct 09 '25 04:10 tjad

This indeed looks like a valid bug. I will assign a Triton engineer to investigate further.

dzier avatar Oct 14 '25 18:10 dzier

Perhaps I took too long, but neither of the shared links resolve correctly for me. 😔

We'll take a look regardless. Thank you for bringing this to our attention.

whoisj avatar Oct 14 '25 19:10 whoisj

@tjad, The repeated "Reaper: found idle CORRID" messages indicate that sequences are waiting in the backlog for available slots. This is expected behavior related to the backlog queue, which holds sequences until slots become available. Slots free up when active sequences either complete or timeout after the max_sequence_idle_microseconds defined in your configuration (5 seconds), whichever occurs first. At that point, backlog requests are assigned a sequence slot.

To improve this situation or reduce the frequency of backlog increments, consider the following recommendations:

  • Set a reasonable client-side timeout on inference requests to minimize longer waiting times and handle timeout errors gracefully.
  • Increase the max_batch_size and/or add more model instances, depending on the available resources, to create additional sequence slots.
  • Ensure that sequences send END flags correctly to free up slots promptly.

The frequent checking by the reaper thread (~50 milliseconds) should not block other models. Please try the recommendations above, and if the issue persists, we can explore further optimizations to manage the backlog.

While technically, the repeated checking and logging could be improved (as backlog sequences are "waiting" rather than "idle"), the provided workarounds are valid and should help mitigate the problem.

cc: @yinggeh, @GuanLuo, @tanmayv25

pskiran1 avatar Nov 06 '25 19:11 pskiran1

Thank you for taking a look 🙏 Yes I understand this looks like normal behavior, however I assure you the infinite looping should not happen - it is doing 100% utilization of CPU forever, never stops, even after no new requests come to the server. It also prevents any models from being usable.

Here is a branch I have been working on which aids with the fixing of this problem. However it doesn't fully resolve the issue, there is something deeper.

https://github.com/tjad/triton-core/tree/fix/sequence_batcher

I also found similar behaviour in the Dynamic batcher, and here is a branch I am working on for fixes.

https://github.com/tjad/triton-core/tree/fix/dynamic_batcher

Unfortunately this issue is extremely difficult to reproduce, I am trying to create a test case for its reproducibility.

I will update here when I am satisified that the full issue has been resolved. Then we can fully isolate the cause/patch for this.

tjad avatar Nov 11 '25 08:11 tjad

If it helps to understand, this issue also seems to have nothing to do with load. We have stress tested Triton for 40-50 hours non-stop with high load, and never saw the issue occur. Yet, in our production environment, the issue can occur when there is 1/50th of the load (like 2 or 3 requests). So there is something else at hand here.

tjad avatar Nov 11 '25 09:11 tjad

And as a note, my stance is that this is a bug inside triton. Triton should be robust, so as to not lock-up in an infinite loop (especially iterating over the same CORRID forever, in fractions of a millisecond).

tjad avatar Nov 11 '25 09:11 tjad

@tjad, @pskiran1 brought this up internally and we've agreed to find cycles to investigate this further. We'll update you as progress is made. Thanks again for bringing this to our attention.

whoisj avatar Nov 14 '25 22:11 whoisj

Thank you very much for assisting @whoisj @pskiran1 🙏 . I will also spend some time again soon (hopefully toward end of next week - if not sooner). Yes, we do still see this in production, and I do need to get to the bottom of it - so am happy to assist where I can.

tjad avatar Nov 20 '25 14:11 tjad

Hi @whoisj , @pskiran1 . I have been running basic concurrency tests on Triton which are out of context of our production environment, and have not been able to reproduce what we observe in our production environment (yet). _It makes me suspect that our model implementation may be problematic.

However, what remains unclear to me still is why we observe the triton process having 100% cpu utilization - when the model runs on a separate process(python backend), which is idle (no utilization). - in conjunction with the previous logs and findings I provided above.

I am making this my top priority over the coming days. If I can't reproduce this outside the production environment, I think it is reasonable to close this "bug", until it is actually reproducible. I will try to update this thread within 3 days.

For clarity, I am currently running tests against vanilla 25.07 image (2.59.1)

tjad avatar Nov 29 '25 01:11 tjad

@tjad, for all backend except the Python Backend, model processing will contribute to Triton's CPU utilization.

The difference for the Python Backend is that each "Python thread" is a separate process with cooperation managed by Triton.

Does this help shed any light?

whoisj avatar Dec 02 '25 18:12 whoisj

Hi @tjad, Could you please let us know if you were able to reproduce the issue? We recommend upgrading to the latest Triton version 25.11 to avoid any issues that have already been fixed.

pskiran1 avatar Dec 08 '25 13:12 pskiran1

@tjad, we are closing this issue as we have not received any updates in over two weeks. Please feel free to reopen it if needed. Thank you.

pskiran1 avatar Dec 16 '25 13:12 pskiran1