Chien-Chin Huang comments

Results 119 comments of


                                            Chien-Chin Huang

[Experimental Feature] Huggingface model training

@junjzhang Sounds good. I'm refactoring `train.py`. Let's discuss how can we make it more general. I don't expect HF models or other models can just adopt the original `train.py`, even...

OOM recovery under multi-node FSDP/HSDP

Is the question about recovery from OOM or debugging an OOM? There are two tools that can help you to debug OOM: 1. `--profiling.enable_memory_snapshot` which will give you the memory...

OOM recovery under multi-node FSDP/HSDP

The `torchelastic` comment is the same as my original comment -- "restart the trainer as some computation and communication kernels may be in an undefined states.". You can think `TorchElastic`...

Configure arbitrary frozen modules via config

I'm not sure if this should be a configurable option. Instead, if a model requires some parts to be frozen, it should be coded in the model. And our trainer...

Can we support outputting checkpoints directly in .pt format?

It is reasonable to remove FP8 subclass from the checkpointing. I'll submit a PR for this. I may need some help from AO team to discuss how to remove FP8...

Can we support outputting checkpoints directly in .pt format?

I believe we can support both formats. The issue is that how do we remove the FP8Tensor.

Can we support outputting checkpoints directly in .pt format?

@vkuzo I can draft a PR to enable saving the state_dict to `.pt`. We don't need a hook for that. We just always convert the FP8 to the dtype users...

Can we support outputting checkpoints directly in .pt format?

@vkuzo, @danielvegamyhre, @andrewor14 Please see the TODO in code of https://github.com/pytorch/torchtitan/pull/1219. We just need to convert the FP8 tensor to the regular tensor in the `_export_weights()`.

[torchtitan][replicate] experimenting new replicate integration with torchtitan

The 1D mesh as the input has been in the roadmap but there are two issues. 1. We need DeviceMesh unflatten support support which the PR is being reviewed but...

Fast dataset resume

Can you we have an accuracy verification for this PR? I believe llama3 8B can reproduce the loss issue if the dataset doesn't resume correctly.