TensorRT-LLM chore: Refactor return of first gen token in PD

In the current implementation, we will return the first and second generated token together from generation worker. I refactor this logic and return the first generated token from context worker as soon as it finishes computation.

Mar 23 '25 14:03 Shunkangz

/bot run

Mar 23 '25 14:03 Shunkangz

@chuangz0 @xiaoweiw-nv @pcastonguay can you help review this PD related MR?

Thanks June

Mar 23 '25 22:03 juney-nvidia

/bot run

Mar 24 '25 08:03 Shunkangz

/bot run

Mar 24 '25 08:03 Shixiaowei02

/bot run

Mar 24 '25 08:03 Shixiaowei02

PR_Github #270 [ run ] triggered by Bot

Mar 24 '25 08:03 niukuo

PR_Github #270 [ run ] completed with state FAILURE /LLM/main/L0_MergeRequest_PR pipeline #260 completed with status: 'FAILURE'

Mar 24 '25 12:03 niukuo

Can we add a description? What is this PR fixing? Can we add unit or integration tests to verify what this PR is fixing? See existing integration tests in https://github.com/NVIDIA/TensorRT-LLM/blob/main/tests/integration/defs/disaggregated/test_disaggregated.py.

Mar 24 '25 13:03 pcastonguay

Can we add a description? What is this PR fixing? Can we add unit or integration tests to verify what this PR is fixing? See existing integration tests in https://github.com/NVIDIA/TensorRT-LLM/blob/main/tests/integration/defs/disaggregated/test_disaggregated.py.

I think that this MR does not change the expected output of PD. We can reuse the existing integration tests. Does it make sense to you? @pcastonguay

Mar 25 '25 13:03 Shunkangz

/bot run

Mar 25 '25 14:03 Shunkangz

PR_Github #444 [ run ] triggered by Bot

Mar 25 '25 14:03 niukuo

PR_Github #444 [ run ] completed with state FAILURE /LLM/main/L0_MergeRequest_PR pipeline #379 completed with status: 'FAILURE'

Mar 25 '25 14:03 niukuo

Can we add a description? What is this PR fixing? Can we add unit or integration tests to verify what this PR is fixing? See existing integration tests in https://github.com/NVIDIA/TensorRT-LLM/blob/main/tests/integration/defs/disaggregated/test_disaggregated.py.

I think that this MR does not change the expected output of PD. We can reuse the existing integration tests. Does it make sense to you? @pcastonguay

Yes thanks.

Mar 25 '25 18:03 pcastonguay

/bot run

Mar 26 '25 06:03 Shunkangz

PR_Github #529 [ run ] triggered by Bot

Mar 26 '25 06:03 niukuo

PR_Github #529 [ run ] completed with state FAILURE /LLM/main/L0_MergeRequest_PR pipeline #452 completed with status: 'FAILURE'

Mar 26 '25 07:03 niukuo

/bot run

Mar 28 '25 03:03 Shunkangz

PR_Github #666 [ run ] triggered by Bot

Mar 28 '25 03:03 tensorrt-cicd

PR_Github #666 [ run ] completed with state SUCCESS /LLM/main/L0_MergeRequest_PR pipeline #560 completed with status: 'SUCCESS'

Mar 28 '25 06:03 tensorrt-cicd

/bot reuse-pipeline

Apr 01 '25 04:04 Shunkangz

PR_Github #840 [ reuse-pipeline ] triggered by Bot

Apr 01 '25 04:04 tensorrt-cicd

PR_Github #840 [ reuse-pipeline ] completed with state SUCCESS Reusing PR_Github #666 for commit 613ec08

Apr 01 '25 04:04 tensorrt-cicd