Chien-Chin Huang comments

Results 119 comments of


                                            Chien-Chin Huang

Fast dataset resume

Can you also fix the linter error and integration test error? I will try if I can verify with llama3.

After applying commit 6a3a9da, the training job runs out of GPU memory (OOM).

I'm surprised Flux is affected as it doesn't use FlexAttention. Can I get the command you use? Is this specific for AMD GPUs? Also how many steps have you ran?...

After applying commit 6a3a9da, the training job runs out of GPU memory (OOM).

I checked the code, Flux doesn't seem to use attention.py and has its own train.py. So Flux shouldn't be affected by the refactor. @wwwjn is my understanding correct? Or do...

After applying commit 6a3a9da, the training job runs out of GPU memory (OOM).

I think it is a different issue. Flux does not support CP yet.

[Checkpointing] Using keep_latest_k setting results in failure when using external mounted drive

I think this won't happen with the latest TorchTitan as we added `os.path.isdir(self.folder)` to check. We probably need to use fsspec to delete files.

[Feature] expose Torch Nan checker as configurable option in toml for those training at scale

A related ask: https://github.com/pytorch/torchtitan/issues/916. Should we add the checker by default and raise assert if Nan happen?

Why is the ep mesh derived from a factoring of the dp mesh, instead of its own dimension?

@man2machine DeviceMesh is not designed to decide how researchers/users parallelize a model. Instead, researchers/users decide how to parallelize the model and use DeviceMesh to simplify the connectivity representation in the...

Context Parallel for Qwen3

I tentatively enable CP + SDPA for Qwen3 in https://github.com/pytorch/torchtitan/pull/2144. But I haven't verified the EP + CP part, which we may need some verifications.

Support Gemma2 in torchtitan

Missing optimizer state for the tied weights should already be fixed a while ago, https://github.com/pytorch/pytorch/pull/128685. Can you point out which PyTorch version you use? @yzhangcs Updated: I checked the fix...

Support Gemma2 in torchtitan

> * I'm wondering if disabling this option might significantly impact performance, especially w/o PP. No, it won't > * It would be great if PyTorch could provide full support...