Jiani Wang
Jiani Wang
> Hi [@wwwjn](https://github.com/wwwjn), thank you for your support! I’ve prepared a reproduction script based on the [latest main branch](https://github.com/pytorch/torchtitan/commit/a44dff1a41f6c0d8e504919ce4b1b50d05102f01), along with some instructions. Here is the [code](https://github.com/speed1313/torchtitan/commit/08fb43479eedc5016383cd4db628b9d38465a25d#diff-78cc79291e219fdfd73f7b1c7b5d442d1346f821b8add32e0e02d62597fe0ee5). I ran it...
Hi folks, here's more update and observations: ## Hypothesis: - Tok_embedding weight uneven sharding - Optimizer status is not saved correctly - FSDPModulel wrap problem ## Code: https://github.com/wwwjn/torchtitan/pull/new/debug_vocab_size - Use...
Update: Verified it works well! Thanks @weifengpy for help fixing the issue.
> Hey, I've already implemented something WIP here: [janEbert@72c7b4e](https://github.com/janEbert/torchtitan/commit/72c7b4e5521de7c336b51dca22fcd75f50aa8f25) > > The main part is the use of the `_ScheduleForwardOnly` pipeline schedule for evaluation, the rest is just using the...
Generally speaking, yes we would love to support generalized validation function. We would love to add some function in train.py , eg `eval_step()`, `eval()`. But this work might need more...
> [@wwwjn](https://github.com/wwwjn) Do you think your implementation is generalized enough to put it to the core train.py? Currently we only run the eval on Rank0 (which is not generalized to...
Yes, this is related to #1194
@CarlosGomes98 I did some test before #1195 get merged. There are several source of un-deterministic in loss (ideally, we should be able to reproduce the loss with `--training.determinisitc` enabled): 1....
I also dig into all the states (optimizer, model weights, dataloader, lr_scheduler) and calculated hash before saving and after loading at step6. Here's an example how I calculated the hash:...
Add more commands offline: RNG states change is deteministic - since we call set_deterministic() every time we initialize a trainer. Run1: without checkpoint load/save - [On rank 0] We set...