Raja Gond issues

Results 11 issues of


                                            Raja Gond

Higher Kernel Launch CPU Overhead

I have observed a higher CPU launch overhead when using CuPy's matrix multiplication compared to PyTorch. While the GPU computation is almost similar for both, the CPU overhead for launching...

cat:performance

Performance Discrepancy B/W 2D and 3D (input) Matrix Multiplications

### Description There is a significant performance discrepancy between 2D and 3D matrix multiplications in CuPy. When performing a matrix multiplication with 2D inputs, the operation completes significantly faster than...

cat:performance

prio:high

Sequence Parallel Fused Kernel Not Getting Built

Hi, I followed the instructions given [here](https://github.com/facebookresearch/xformers?tab=readme-ov-file#installing-xformers) to build and install the latest xformers version. More specifically, I run the below command, but it seems that [sequence_parallel_fused kernel](https://github.com/facebookresearch/xformers/commit/342de87b6dcf6f6f1d410823479af0c14aa03317) is not...

bug

[Misc]: Random Output Generation with mistralai/Mixtral-8x22B-v0.1

I am trying to run inference with mistralai/Mixtral-8x22B-v0.1 model, but it is generating random output with an 8-way tensor parallel setup. Below are the details of the configuration and I...

misc

[BUG] Weird Behavior

**Describe the bug** Negative communication time which is unexpected. **To Reproduce** `torchrun --node_rank=0 --nproc_per_node=8 --nnodes=1 --rdzv_endpoint=127.0.0.1:23456 test/test_ag_kernel.py 1024 57344 8192 --dtype=bfloat16 --iters=100` **Behavior** ```bash SOL time for GEMM(M=1024,N=57344,K=8192,TP=8): 0.122ms torch...

Raja Gond

Higher Kernel Launch CPU Overhead

Performance Discrepancy B/W 2D and 3D (input) Matrix Multiplications

Sequence Parallel Fused Kernel Not Getting Built

[Misc]: Random Output Generation with mistralai/Mixtral-8x22B-v0.1

[BUG] Weird Behavior

Regarding GEMV.AG and O.AG

[QUESTION] E2E Overlap: Flux design

[BUG] Fuse Reduction on SM 90

[QUESTION] Gemm +RS on 8xH100

Intra-node compute-communication overlap without SM cores