Chien-Chin Huang

Results 119 comments of Chien-Chin Huang

@junjzhang Sounds good. I'm refactoring `train.py`. Let's discuss how can we make it more general. I don't expect HF models or other models can just adopt the original `train.py`, even...

Is the question about recovery from OOM or debugging an OOM? There are two tools that can help you to debug OOM: 1. `--profiling.enable_memory_snapshot` which will give you the memory...

The `torchelastic` comment is the same as my original comment -- "restart the trainer as some computation and communication kernels may be in an undefined states.". You can think `TorchElastic`...

I'm not sure if this should be a configurable option. Instead, if a model requires some parts to be frozen, it should be coded in the model. And our trainer...

It is reasonable to remove FP8 subclass from the checkpointing. I'll submit a PR for this. I may need some help from AO team to discuss how to remove FP8...

I believe we can support both formats. The issue is that how do we remove the FP8Tensor.

@vkuzo I can draft a PR to enable saving the state_dict to `.pt`. We don't need a hook for that. We just always convert the FP8 to the dtype users...

@vkuzo, @danielvegamyhre, @andrewor14 Please see the TODO in code of https://github.com/pytorch/torchtitan/pull/1219. We just need to convert the FP8 tensor to the regular tensor in the `_export_weights()`.

The 1D mesh as the input has been in the roadmap but there are two issues. 1. We need DeviceMesh unflatten support support which the PR is being reviewed but...

Can you we have an accuracy verification for this PR? I believe llama3 8B can reproduce the loss issue if the dataset doesn't resume correctly.