Raja Gond
Raja Gond
Thanks @lw @danthe3rd ! Could you please briefly explain the differences between the three types of kernels: fused, fused_no_wait, and fused_no_wait_no_memcpy? It seems that only the output from the fused...
Any update on this? I tried with `NVIDIA Nsight Systems version 2023.4.1.97-234133557503v0`, but even with that, it is not working.
@wenlei-bao @houqi
 In the end-to-end (E2E) implementation, you have used Tensor Parallelism, correct? Sorry, I’m a bit confused. ##### Dense Baseline Workflow pre-projection → Attention → post-projection → All Reduce →...
python = 3.11 torch = 2.6.0
PyTorch version: 2.2.1 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A OS: Ubuntu 20.04.6 LTS (x86_64) GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 Clang...
In my experiment, I found that AG is approximately 55–58% the cost of AR. However, I still don't fully understand the math behind it. For example, in LLaMA-70B on 8×A100,...
Thanks, this was really helpful. Also, you might want to update the snapshot ID of the Hugging Face models. The snapshot ID is hardcoded in the source code, and I...
Could you also clarify what you mean by `batch_size`? Does `batch_size` refer to the shape `(B, seq_length)`? ```py global_batch_size = 2048 decode_batch_size = 1280 prefill_batch_size = 768 ``` Above seems...
How do I run it on other GPUs? It seems like I have to manually profile each operation for different sizes and SMs before I can use autosearch.