Jinyang Yuan issues

Results 5 issues of


                                            Jinyang Yuan

perf: Enable CUDA graphs when attention DP is used and active requests on different GPUs are uneven

This PR modifies the code related to dummy requests to allow the use of CUDA graphs when attention DP is used and active requests on different GPUs are uneven.

feat: Optionally split MoE inputs into chunks to reduce GPU memory usage

If `max_num_tokens` is large and attention DP is enabled on a relatively large number of GPUs, the MoE workspace size will be very large and thus OOM occurs. This MR...

fix: Fix an error related to dummy request when MTP is used

The error is fixed by setting `max_num_draft_tokens` when creating dummy requests.

perf: Use pinned H2D to reduce bubbles

In some cases, some pageable H2D operations are followed by `cudaStreamSynchronize` operations, which block kernel launches on CPU. This problem can be solved by changing pageable H2D to pinned H2D.

[perf] Reduce the workspace size of FP4 activation scales for MoE

The first two dimensions of the original FP4 activation scales are merged to remove unnecessary storage space. Appropriate paddings are added when merging these two dimensions in consideration of the...