Driss Guessous issues

Results 58 issues of


                                            Driss Guessous

Add MXFP casting kernels from triton Repro

# Summary They have recently published alot of good upcast and downcast kernels: https://github.com/triton-lang/triton/blob/main/python/triton_kernels/triton_kernels/numerics_details/mxfp.py We should update the ones we have in AO and bench against Inductor

performance

Move Experimental to TorchAO prototype

# Summary The naming / project structure is confusing I think we should move this project to Prototype folder

triaged

New Pytorch Triton breaks custom cast kernel MX

# Summary To run: `TRITON_ALWAYS_COMPILE=1 TRITON_DUMP_DIR=my_directory_2 TRITON_KERNEL_DUMP=1 pytest -s -v test/prototype/mx_formats/test_custom_cast.py -k "test_fp4_triton_unscaled_cast"` Bad ttir on left good on right. No real differences https://www.diffchecker.com/ueX5YZw4 TTGIR: https://www.diffchecker.com/M5PS6QJg/ Differences in PTX https://www.diffchecker.com/8mseNnKA/

triaged

[FlexAttention] Don't load invalid values from mask mod

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #150331 ## Summary See https://github.com/pytorch/pytorch/issues/150321 for more details cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng...

Stale

module: inductor

ciflow/inductor

Make sure the correct version of ao is installed in CI

## Essential Elements of an Effective PR Description Checklist ## Purpose Fix hardcoded CUDA version in torchao installation to use dynamic CUDA version detection ## Test Plan Run the quantization...

ready

ci/build

[FlexFlas] Blackwell fwd support

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #167040 cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben

topic: not user facing

module: inductor

ciflow/inductor

[Cute] Block sparse support Sm100

# Summary - Implement block-sparse attention in flash_fwd_sm100.py - Update interface.py to handle SM100 block size calculations (2x multiplier for m_block_size since 1 CTA handles 2*tile_m rows) - Add mask_mod...

[CuteDSL] ICE for flash combine kernel on blackwell

# Summary Getting ICE for this test ## Repro ``` Python git clone [email protected]:Dao-AILab/flash-attention.git pip install flash_attn/cute/ pytest -v tests/cute/test_flash_attn.py -k "test_flash_attn_combine[1-1-64-dtype0]" ``` ## Output ``` Shell ============================= test session...