Driss Guessous

Results 183 comments of Driss Guessous

@eqy Thanks for staying on top of this I am going to close this issue since the stride PR has landed and we root caused the other slow down.

``` ❯ python rc.py ---------------------------------------------SDPA-Flash--------------------------------------------- ALL GOOD ---------------------------------------------SDPA-CuDNN--------------------------------------------- ALL GOOD p% ❯ pip freeze | grep torch -e git+https://github.com/pytorch-labs/attention-gym@2e4d04aa1c500879400ba2547e106f135fd5a4c1#egg=attn_gym pytorch-triton==3.1.0 torch==2.6.0+cu124 # Editable install with no version control (torchao==0.6.1) ~/meta/scripts/sdpa...

For the manual API why have both a string and a `int4wo(group_size)`, I think it would be cleaner to just have one version of this

>so the motivation for string is so that people don't need to import anything to use it, it's just a simple shortcut and we'll make sure to align the names...

Should we try and land this one?

Can you make sure to set: https://github.com/pytorch/ao/blob/e7b33bc91c831d10249c1222c8b4b667f18f28b7/torchao/float8/config.py#L246 to True

@jeffdaily Have you verified that the existing fp8 routines work on ROCm? Unfortunately we still dont have ROCm runners in CI/CD and at least personally dont have much access to...

@clintg6 you need to specify the dtype for these to be the nuz variant e.g.: `float8_dynamic_activation_float8_weight(torch.float8_e4m3fnuz, torch.float8_e4m3fnuz)`

The discussion was that dynamic rank is less common, and in that case `-1` will likely not work. But most common quantization schemes w/ varying sizes in fixed rank tensors...