Chien-Chin Huang
Chien-Chin Huang
PTD DCP is designed to do online resharding for model and optimizer states. More specifically, if all the model parallelisms are PTD native (fully_shard, TP, PP), then the saved checkpoint...
The only thing you can do is to `torch.load` the `.metadata`. The actual data files are not unpickled without writing some code.
`with_stack` has been caused timeout because it significant slow down the profiling for large models. It's better to make it optional.
`with_stack` only.
@TJ-Solergibert `get_train_context()` will specify the SDPABackend, including memory efficient, cudnn, and flash when CP is enabled. We can extend it to even CP is not enabled. As for the error...
In such a case, you probably need `autocast` in `get_train_context()`. We apply all the mixed precision within `parallelize_llama`. With only one GPU, nothing is going to be added on top...
Currently, TorchTitan only support rank0 profiling or all ranks profiling. But this request is reasonable, do you want to submit a PR for this?
@weifengpy Do you have insights on this?
But how come without checkpointing the loss curve matches (slightly off) but with checkpointing the loss curve is way off?
How do we know if nan/inf is not caused by bad training/modeling/hyper-parameters? Would it be better that the training should stop when encountering bad loss and let the model author...