Federico Cassano

Results 12 issues of Federico Cassano

### 🚀 The feature, motivation and pitch The LFCE kernel allocates a `grad_weight` tensor: https://github.com/linkedin/Liger-Kernel/blob/a8fa3bb37850e89500261024ff47da0c626ab75f/src/liger_kernel/ops/fused_linear_cross_entropy.py#L47 This tensor then gets updated throughout the chunked loss calculation and finally used in the...

Hello, I am getting the following error whenever I scale up training to 512 GPUs while using FSDP2 + AdamWFP8 + BF16 stochastic rounding: ``` torch._dynamo.exc.TorchRuntimeError: Dynamo failed to run...

bug
distributed
high priority
optimizer
triaged
triage review