Ke Bao comments

Results 60 comments of


                                            Ke Bao

Fix allgather ops inside cuda graphs

Hi @nvcastet , pynccl cannot be removed since it's used for cu118 environment. PyTorch cu118 will install the nvidia-nccl-cu11 which with cuda 11.0 version of nccl by default. But cuda...

Fix allgather ops inside cuda graphs

> @ispobock when downloading `nvidia-nccl-cu11`, I see `cu116`: > > ``` > # pip download nvidia-nccl-cu11 > Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com/ > Collecting nvidia-nccl-cu11 > Downloading https://developer.download.nvidia.com/compute/redist/nvidia-nccl-cu11/nvidia-nccl-cu11-2022.5.19.tar.gz (16 kB)...

Ke Bao

Fix allgather ops inside cuda graphs

Fix allgather ops inside cuda graphs

Fix allgather ops inside cuda graphs

feat: Add chat template content like `<think>` to response

Fix draft decode max batch size

[Bug] Why enable_dp_attention is much slower when running DeepSeekV3 on 8xH200

[Bug] MLA slower than default for small context long outputs and generating bad output reproducibly

[Bug] run DeepSeek-R1 with --tp 2 --dp 2 --enable-dp-attention error

Question about cutedsl fp4 gemm sf physical order

[Feature] DeepSeek V3 optimization