Chien-Chin Huang comments

Results 119 comments of


                                            Chien-Chin Huang

issues on llama3 compile + (async) TP + AC

@tianyu-l Good point, I forgot this every time :( Yes, that may be the root cause. I'll verify that. Besides that, I think other performances are reasonable. I noticed that...

issues on llama3 compile + (async) TP + AC

For the CI issue, the error is consistently failed on a CUDA driver API to set a virtual address, which makes me think that this may be related to machine...

Add compiled autograd tutorial

@yitingw1 When enable CompiledAutograd, we should also enable the new CompiledDDP. Right now it is not automatically enabled. As for the overlapping, the answer is yes if the new CompiledDDP...

[WIP] Implement the feature to save unsharded weights at the last step

> This would require HF dependency in torchtitan core, right? Yes, unfortunately, that is the case. PyTorch also optionally depends on HF due to DCP. We can use the same...

Async TP integration test

Async TP test is enabled with H100.

[WIP]Implement llama4 HF format to DCP converter

lol, okay, do we want to keep the one in experiments or actually have the ones in the main scripts?

[WIP]Implement llama4 HF format to DCP converter

okay, since you already merge them, I'll make this PR to be fixing the issues. But I'll keep the description of the PR since I would like to track the...

Model init with HuggingFace model

Yes, @mori360, as you have implemented this feature, OOM should be able to avoid with `set_model_state_dict`. But we will need the state_dict to be loaded with DCP and `set_model_state_dict`.

Model init with HuggingFace model

@mingdianliu We are exploring an offline resharding converter to speed up the loading time, https://github.com/pytorch/torchtitan/pull/1104.

[Experimental Feature] Huggingface model training

Is there a plan to deduplicate the code from the main TorciTitan? What's the motivation of duplicating `main.py` or `train()`? Is it because of `state_dict` loading? If so, we can...