xu-yfei

Results 2 issues of xu-yfei

## Motivation This pr is for dp mla #5001 About dp mla: On an 8*H20(96GB), weight mem usage=87.19 GB when `--dp-size 4 --enable-dp-attention`, not enough memory left. This optimization is...

## Motivation Base on dp_mla_kernel PR #5000 **Description**: On an 8*H20(96GB), weight mem usage=87.19 GB when `--dp-size 4 --enable-dp-attention`, not enough memory left. This optimization is similar to data parallelism...