Ke Bao

Results 60 comments of Ke Bao

Hi @nvcastet , pynccl cannot be removed since it's used for cu118 environment. PyTorch cu118 will install the nvidia-nccl-cu11 which with cuda 11.0 version of nccl by default. But cuda...

> @ispobock when downloading `nvidia-nccl-cu11`, I see `cu116`: > > ``` > # pip download nvidia-nccl-cu11 > Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com/ > Collecting nvidia-nccl-cu11 > Downloading https://developer.download.nvidia.com/compute/redist/nvidia-nccl-cu11/nvidia-nccl-cu11-2022.5.19.tar.gz (16 kB)...

@nvcastet Could you fix the lint with `pre-commit run --all-files`?

> If want to force the prefix to be generated, is it more elegant to set a chat template? I personally think it is better than implementing it through code...

How many request rate did you set in the benchmark? `--enable-dp-attention` can improve throughput for large QPS scenarios.

> Without MLA I'm not noticing any odd outputs. @pseudotensor If you add `--enable-flashinfer-mla`, it will use MLA with flashinfer backend. If you remove the option, it will use MLA...

@v-lmn > why self.dp_size = self.tp_size Currently, DP and TP attention cannot be combinable. If you set `--tp 8 --enable-dp-attention`, it will only use 8-way data parallelism for MLA part....

We have finished spec module refactor and will support `nextn` in the next 1~2 weeks.