GUO-QING JIANG comments

Results 18 comments of


                                            GUO-QING JIANG

[Chatllama] Supervised Finetune on llama-7B

Have the same Qs: ![image](https://user-images.githubusercontent.com/11551984/225604076-1e16a706-7192-4854-95cc-fbd3f8c13bdf.png)

[PYTORCH::FP8] FP8 significantly slow down when scaling up to 1000+ GPUs

> Hi @Ageliss, could you share more details on your training setup? Most probable reason for the lower speedup observed is that there are other bottlenecks (most probably communication, since...

[PYTORCH::FP8] FP8 significantly slow down when scaling up to 1000+ GPUs

Hi, @ptrendx , -30% and +17% numbers are for the end2end training speed. We checked the timeline and found gemm did not occupy too much on one step, less than...

FP8 Memory Usage

> Are you using FP8 training? @jomayeri Any update to deal with the more memory usage with FP8？

Support for Hopper H100

I had a bench test on H800, maybe a little bit slower than H100. Hope it could help.

Also, I had another question that how marlin performs comparing with TRT-LLM : __device__ void weight_only_batched_gemv() https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernel.h#L296 Recently, a NIPS paper called Quip also shared a version of W2~W4.GEMM, it...

GUO-QING JIANG

[Chatllama] Supervised Finetune on llama-7B

[PYTORCH::FP8] FP8 significantly slow down when scaling up to 1000+ GPUs

[PYTORCH::FP8] FP8 significantly slow down when scaling up to 1000+ GPUs

FP8 Memory Usage

Support for Hopper H100

Support for Hopper H100

[Bug] H800 run UT failed.

怎么设置 kv cache int8 量化，但 a 和 w 仍然是f16，测试 kvcache 量化的收益

Random Hidden and Epochs > 1