GUO-QING JIANG

Results 18 comments of GUO-QING JIANG

Have the same Qs: ![image](https://user-images.githubusercontent.com/11551984/225604076-1e16a706-7192-4854-95cc-fbd3f8c13bdf.png)

> Hi @Ageliss, could you share more details on your training setup? Most probable reason for the lower speedup observed is that there are other bottlenecks (most probably communication, since...

Hi, @ptrendx , -30% and +17% numbers are for the end2end training speed. We checked the timeline and found gemm did not occupy too much on one step, less than...

> Are you using FP8 training? @jomayeri Any update to deal with the more memory usage with FP8?

I had a bench test on H800, maybe a little bit slower than H100. Hope it could help.

Also, I had another question that how marlin performs comparing with TRT-LLM : __device__ void weight_only_batched_gemv() https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernel.h#L296 Recently, a NIPS paper called Quip also shared a version of W2~W4.GEMM, it...

> Hi, unfortunately, I don't have access to any H800s (or any Hopper GPUs for that matter), so it is a bit hard to test. Which of the matrix shapes...

> What training data did you use, and what is its size? We actually used our human annotated sft dataset, about 110k~240k. We also masked out the system and user...