Chien-Chin Huang
Chien-Chin Huang
Can you also fix the linter error and integration test error? I will try if I can verify with llama3.
I'm surprised Flux is affected as it doesn't use FlexAttention. Can I get the command you use? Is this specific for AMD GPUs? Also how many steps have you ran?...
I checked the code, Flux doesn't seem to use attention.py and has its own train.py. So Flux shouldn't be affected by the refactor. @wwwjn is my understanding correct? Or do...
I think it is a different issue. Flux does not support CP yet.
I think this won't happen with the latest TorchTitan as we added `os.path.isdir(self.folder)` to check. We probably need to use fsspec to delete files.
A related ask: https://github.com/pytorch/torchtitan/issues/916. Should we add the checker by default and raise assert if Nan happen?
@man2machine DeviceMesh is not designed to decide how researchers/users parallelize a model. Instead, researchers/users decide how to parallelize the model and use DeviceMesh to simplify the connectivity representation in the...
I tentatively enable CP + SDPA for Qwen3 in https://github.com/pytorch/torchtitan/pull/2144. But I haven't verified the EP + CP part, which we may need some verifications.
Missing optimizer state for the tied weights should already be fixed a while ago, https://github.com/pytorch/pytorch/pull/128685. Can you point out which PyTorch version you use? @yzhangcs Updated: I checked the fix...
> * I'm wondering if disabling this option might significantly impact performance, especially w/o PP. No, it won't > * It would be great if PyTorch could provide full support...