Federico Cassano issues

Repositories
Issues
Comments

Results 12 issues of


                                            Federico Cassano

Gradient checkpointing for `grad_weight` in LFCE

### 🚀 The feature, motivation and pitch The LFCE kernel allocates a `grad_weight` tensor: https://github.com/linkedin/Liger-Kernel/blob/a8fa3bb37850e89500261024ff47da0c626ab75f/src/liger_kernel/ops/fused_linear_cross_entropy.py#L47 This tensor then gets updated throughout the chunked loss calculation and finally used in the...

Dynamo error with large mesh + AdamWFp8 + bf16 stochastic rounding

Hello, I am getting the following error whenever I scale up training to 512 GPUs while using FSDP2 + AdamWFP8 + BF16 stochastic rounding: ``` torch._dynamo.exc.TorchRuntimeError: Dynamo failed to run...

bug

distributed

high priority

optimizer

triaged

triage review