wdykas

Results 7 issues of wdykas

### Your current environment The output of `python collect_env.py` ```text PyTorch version: 2.4.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A OS:...

bug

I am attempting to use cuda graphs with `cudaMallocAsync` and mamba. The code seems to work fine if I am using the regular allocater and cuda graphs but I am...

When running simple inference with Mamba 2 on H100 on long sequence lengths=`512000`. I am hitting illegal memory access in _mamba_chunk_scan_combined_fwd: ``` File "/usr/local/lib/python3.10/dist-packages/mamba_ssm/ops/triton/ssd_combined.py", line 315, in _mamba_chunk_scan_combined_fwd dA_cumsum, dt...

# Description I want to be able to control num splits in FA3. This exposes this argument for non-context-parallel cases. ## Type of change - [ ] Documentation change (change...

# Description This is a more memory efficient version for using symmetric memory all reduces. We use a pool of symmetric memory that we grow if we need it to...

community-contribution

# What does this PR do ? This is the initial starting for native fast weight resharding in Megatron. The fast path is being added it just might not be...

# What does this PR do ? This PR is for batch invariance particularly for RL we can get complete match between Megatron inference and training. This is in the...