mobicham comments

Results 113 comments of


                                            mobicham

[triton 3.2] std::bad_alloc: torch.compile breaks with Triton built from source

Thanks @davidberard98 , for now I am using `torch==2.6.0.dev20241101+cu121` which works fine. This is what I get with **gdb** (`2.6.0.dev20241112+cu121`): ``` (gdb) run Starting program: /opt/conda/bin/python test_torch.py warning: Error disabling...

[triton 3.2] std::bad_alloc: torch.compile breaks with Triton built from source

Sorry for the delay @davidberard98 , I just tried with the nightly build and luckily it's working this time (`2.6.0.dev20241218+cu124`) Really appreciate your support, thanks!

FP8E4 tl.dot support for AMD

@antiagainst indeed: ```Python dtype = torch.float8_e5m2 # v_mfma_f32_32x32x8_f16 a[0:15], v[4:5], v[18:19], a[0:15] dtype = torch.float8_e4m3fnuz # v_mfma_f32_32x32x16_fp8_fp8 a[0:15], v[12:13], v[0:1], a[0:15] dtype = torch.float8_e4m3fn #ERROR ``` So `float8_e4m3fnuz` seems to...

FP8E4 tl.dot support for AMD

@antiagainst from the official AMD doc, it says that fp8 has a [-448, 448], while Torch/Triton is using `float8_e4m3fnuz` which has the [-240, 240] range - I am a bit...

Hqq serialization

1/3 https://github.com/huggingface/transformers/pull/33141/commits/5cb7d81547908dea660f525be5f77d9065b6edeb Removed the `check_old_param` hack. The problem however is that `HQQLinear.state_dict` is huge, which makes loading extremely slow. So I added `run_expected_keys_check` which skips those checks for `HQQLinear` params....

Hqq serialization

2/3: Multi-gpu loading Loading on multi-gpu looks like it's working fine. There's an issue with the BitBlas backend I just reported here Forcing the input to use the same device...

Hqq serialization

@SunMarc - Reverted back to `if isinstance(module, (torch.nn.Linear, HQQLinear)):` but we still need that `run_expected_keys_check` otherwise it breaks - Updated the default `HqqConfig` default params since `quant_scale`, `quant_zero`, and `offload_meta`...