Jiani Wang comments

Results 37 comments of


                                            Jiani Wang

feat: refactor provider registry for safety and improved diagnostics

> Hi [@wwwjn](https://github.com/wwwjn), thank you for your support! I’ve prepared a reproduction script based on the [latest main branch](https://github.com/pytorch/torchtitan/commit/a44dff1a41f6c0d8e504919ce4b1b50d05102f01), along with some instructions. Here is the [code](https://github.com/speed1313/torchtitan/commit/08fb43479eedc5016383cd4db628b9d38465a25d#diff-78cc79291e219fdfd73f7b1c7b5d442d1346f821b8add32e0e02d62597fe0ee5). I ran it...

feat: refactor provider registry for safety and improved diagnostics

Hi folks, here's more update and observations: ## Hypothesis: - Tok_embedding weight uneven sharding - Optimizer status is not saved correctly - FSDPModulel wrap problem ## Code: https://github.com/wwwjn/torchtitan/pull/new/debug_vocab_size - Use...

feat: refactor provider registry for safety and improved diagnostics

Update: Verified it works well! Thanks @weifengpy for help fixing the issue.

[Evaluation] Minimal support for downstream tasks

> Hey, I've already implemented something WIP here: [janEbert@72c7b4e](https://github.com/janEbert/torchtitan/commit/72c7b4e5521de7c336b51dca22fcd75f50aa8f25) > > The main part is the use of the `_ScheduleForwardOnly` pipeline schedule for evaluation, the rest is just using the...

[Feature] Support validation

Generally speaking, yes we would love to support generalized validation function. We would love to add some function in train.py , eg `eval_step()`, `eval()`. But this work might need more...

[Feature] Support validation

> [@wwwjn](https://github.com/wwwjn) Do you think your implementation is generalized enough to put it to the core train.py? Currently we only run the eval on Rank0 (which is not generalized to...

[Flux] Incorrect loss after loading from checkpoint

Yes, this is related to #1194

[Flux] Incorrect loss after loading from checkpoint

@CarlosGomes98 I did some test before #1195 get merged. There are several source of un-deterministic in loss (ideally, we should be able to reproduce the loss with `--training.determinisitc` enabled): 1....

[Flux] Incorrect loss after loading from checkpoint

I also dig into all the states (optimizer, model weights, dataloader, lr_scheduler) and calculated hash before saving and after loading at step6. Here's an example how I calculated the hash:...

[Flux] Incorrect loss after loading from checkpoint

Add more commands offline: RNG states change is deteministic - since we call set_deterministic() every time we initialize a trainer. Run1: without checkpoint load/save - [On rank 0] We set...