Jinyang Yuan

Results 5 issues of Jinyang Yuan

This PR modifies the code related to dummy requests to allow the use of CUDA graphs when attention DP is used and active requests on different GPUs are uneven.

If `max_num_tokens` is large and attention DP is enabled on a relatively large number of GPUs, the MoE workspace size will be very large and thus OOM occurs. This MR...

The error is fixed by setting `max_num_draft_tokens` when creating dummy requests.

In some cases, some pageable H2D operations are followed by `cudaStreamSynchronize` operations, which block kernel launches on CPU. This problem can be solved by changing pageable H2D to pinned H2D.

The first two dimensions of the original FP4 activation scales are merged to remove unnecessary storage space. Appropriate paddings are added when merging these two dimensions in consideration of the...