Chien-Chin Huang comments

Results 119 comments of


                                            Chien-Chin Huang

"Universal" Checkpointing

PTD DCP is designed to do online resharding for model and optimizer states. More specifically, if all the model parallelisms are PTD native (fully_shard, TP, PP), then the saved checkpoint...

"Universal" Checkpointing

The only thing you can do is to `torch.load` the `.metadata`. The actual data files are not unpickled without writing some code.

profile with modules and stack

`with_stack` has been caused timeout because it significant slow down the profiling for large models. It's better to make it optional.

profile with modules and stack

`with_stack` only.

[Feature] Enable CUDNN Attention

@TJ-Solergibert `get_train_context()` will specify the SDPABackend, including memory efficient, cudnn, and flash when CP is enabled. We can extend it to even CP is not enabled. As for the error...

[Feature] Enable CUDNN Attention

In such a case, you probably need `autocast` in `get_train_context()`. We apply all the mixed precision within `parallelize_llama`. With only one GPU, nothing is going to be added on top...

Profiling only a select group of ranks

Currently, TorchTitan only support rank0 profiling or all ranks profiling. But this request is reasonable, do you want to submit a PR for this?

[Question] FSDP+TP CUDA_DEVICE_MAX_CONNECTIONS

@weifengpy Do you have insights on this?

[Flux] Test and enable checkpointing for Flux model

But how come without checkpointing the loss curve matches (slightly off) but with checkpointing the loss curve is way off?

[Feature] Support skipping bad grad updates

How do we know if nan/inf is not caused by bad training/modeling/hyper-parameters? Would it be better that the training should stop when encountering bad loss and let the model author...