Li Zhang
Li Zhang
V100 AWQ/GPTQ 刚在 #2090 支持,还没发版
可以先试试 nightly build https://github.com/zhyncs/lmdeploy-build/releases/tag/b28a1d0
> And the difference should not be significant on A100. I have roughly verified it using SGLang's Marlin AWQ and LMDeploy TurboMind's AWQ on Llama 3.1 8B Instruct, and their...
TP 数量影响 Linear 层在 k 方向上的并发度,会造成累加顺序的不同。浮点加法不满足结合律,不同的累加顺序的结果会有细微的差别。
估计是 2080 Ti 不支持 bf16
We need to benchmark the ar/ag case on different systems (NVLink/PCIe) first. https://github.com/NVIDIA/nccl-tests
@irexyc bus bandwidth of all-reduce and all-gather is computed differently.
May be fixed by #2201
@irexyc is this still WIP?
The input dim of `attention.output` should be computed as `head_num * head_dim`. The use of `hidden_units_` is a bug.