Raja Gond comments

Results 14 comments of


                                            Raja Gond

Sequence Parallel Fused Kernel Not Getting Built

Thanks @lw @danthe3rd ! Could you please briefly explain the differences between the three types of kernels: fused, fused_no_wait, and fused_no_wait_no_memcpy? It seems that only the output from the fused...

[Bug]: ERROR 07-26 14:50:35 multiproc_worker_utils.py:120] Worker VllmWorkerProcess pid 214281 died, exit code: -11

Any update on this? I tried with `NVIDIA Nsight Systems version 2023.4.1.97-234133557503v0`, but even with that, it is not working.

[QUESTION] E2E Overlap: Flux design

@wenlei-bao @houqi

[QUESTION] E2E Overlap: Flux design

![Image](https://github.com/user-attachments/assets/e2b57195-6db3-4698-9395-ff22f8ea810b) In the end-to-end (E2E) implementation, you have used Tensor Parallelism, correct? Sorry, I’m a bit confused. ##### Dense Baseline Workflow pre-projection → Attention → post-projection → All Reduce →...

[QUESTION] Gemm +RS on 8xH100

python = 3.11 torch = 2.6.0

[Misc]: Random Output Generation with mistralai/Mixtral-8x22B-v0.1

PyTorch version: 2.2.1 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A OS: Ubuntu 20.04.6 LTS (x86_64) GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 Clang...

Regarding GEMV.AG and O.AG

In my experiment, I found that AG is approximately 55–58% the cost of AR. However, I still don't fully understand the math behind it. For example, in LLaMA-70B on 8×A100,...

Regarding GEMV.AG and O.AG

Thanks, this was really helpful. Also, you might want to update the snapshot ID of the Hugging Face models. The snapshot ID is hardcoded in the source code, and I...

Regarding GEMV.AG and O.AG

Could you also clarify what you mean by `batch_size`? Does `batch_size` refer to the shape `(B, seq_length)`? ```py global_batch_size = 2048 decode_batch_size = 1280 prefill_batch_size = 768 ``` Above seems...

Regarding GEMV.AG and O.AG

How do I run it on other GPUs? It seems like I have to manually profile each operation for different sizes and SMs before I can use autosearch.