Raja Gond
Raja Gond
I have observed a higher CPU launch overhead when using CuPy's matrix multiplication compared to PyTorch. While the GPU computation is almost similar for both, the CPU overhead for launching...
### Description There is a significant performance discrepancy between 2D and 3D matrix multiplications in CuPy. When performing a matrix multiplication with 2D inputs, the operation completes significantly faster than...
Hi, I followed the instructions given [here](https://github.com/facebookresearch/xformers?tab=readme-ov-file#installing-xformers) to build and install the latest xformers version. More specifically, I run the below command, but it seems that [sequence_parallel_fused kernel](https://github.com/facebookresearch/xformers/commit/342de87b6dcf6f6f1d410823479af0c14aa03317) is not...
I am trying to run inference with mistralai/Mixtral-8x22B-v0.1 model, but it is generating random output with an 8-way tensor parallel setup. Below are the details of the configuration and I...
**Describe the bug** Negative communication time which is unexpected. **To Reproduce** `torchrun --node_rank=0 --nproc_per_node=8 --nnodes=1 --rdzv_endpoint=127.0.0.1:23456 test/test_ag_kernel.py 1024 57344 8192 --dtype=bfloat16 --iters=100` **Behavior** ```bash SOL time for GEMM(M=1024,N=57344,K=8192,TP=8): 0.122ms torch...
Hi, In standard tensor parallelism, we typically have: Attention → Output Projection → All-Reduce → LayerNorm → FFN → All-Reduce → LayerNorm. However, in the paper, you use GEMV.AG and...
Hi, I want to understand how you implemented overlap in e2e. Let’s take all-reduce after MLP as an example. You break the all-reduce into reduce-scatter and all-gather. Reduce-scatter is overlapped...
```bash TORCH_CHECK( !fuse_reduction || input_dtype == at::ScalarType::Half, "Fuse reduction only support float16 type on SM80 due to instruction limitation."); ``` It explicitly restricts fused reduction to float16, regardless of GPU...
Why is Gemm + RS performing much worse than torch baseline? ```python #tuning space space: List[TuningConfig] = [] space_M = [8192, 16384, 32768] space_N = [8192] space_K = [28672, 8192]...
Hi, Are you also doing compute communication overlap without SM cores in the intra-node case? I tried searching in your codebase but didn't find the implementation. Could you please clarify...