aj comments

Repositories
Issues
Comments

Results 3 comments of

aj

convert_to_float8_training and torch.compile make model slow

I am facing a similar issue with fsdp2 enabled: ``` m = nn.Sequential( nn.Linear(4096, 4096*3, bias=False), nn.Linear(4096*3, 4096, bias=False), ).to(device=device, dtype=torch.bfloat16) x = torch.randn(32000, 4096, device="cuda", dtype=torch.bfloat16) ``` With FP8:...

convert_to_float8_training and torch.compile make model slow

Got it, thanks a lot for the clarification. @danielvegamyhre so if my understanding is correct this is different from transformer engine's impl, where activations might be stored in FP8?

[Feature request] Integrate DeepGEMM

Hi, Is there any update on this?