Chien-Chin Huang
Chien-Chin Huang
@tianyu-l Good point, I forgot this every time :( Yes, that may be the root cause. I'll verify that. Besides that, I think other performances are reasonable. I noticed that...
For the CI issue, the error is consistently failed on a CUDA driver API to set a virtual address, which makes me think that this may be related to machine...
@yitingw1 When enable CompiledAutograd, we should also enable the new CompiledDDP. Right now it is not automatically enabled. As for the overlapping, the answer is yes if the new CompiledDDP...
> This would require HF dependency in torchtitan core, right? Yes, unfortunately, that is the case. PyTorch also optionally depends on HF due to DCP. We can use the same...
Async TP test is enabled with H100.
lol, okay, do we want to keep the one in experiments or actually have the ones in the main scripts?
okay, since you already merge them, I'll make this PR to be fixing the issues. But I'll keep the description of the PR since I would like to track the...
Yes, @mori360, as you have implemented this feature, OOM should be able to avoid with `set_model_state_dict`. But we will need the state_dict to be loaded with DCP and `set_model_state_dict`.
@mingdianliu We are exploring an offline resharding converter to speed up the loading time, https://github.com/pytorch/torchtitan/pull/1104.
Is there a plan to deduplicate the code from the main TorciTitan? What's the motivation of duplicating `main.py` or `train()`? Is it because of `state_dict` loading? If so, we can...