Maanu Grover
Maanu Grover
Should we add any info in the `package_checker/README.md` about this flag?
I think this is because of #271 . You can certainly get rid of these files before creating submission with something like: `find -type f -name 'scaling.json' -delete` . We...
Hi @PurvangL , I see you've closed this issue, were you able to resolve? I haven't had time to reproduce this issue with SFT, but I've encountered long init times...
Another note, from my experience, `sync_dist=True` is tricky with parallelisms, especially pipeline parallelisms. If some metrics are non-existent on some ranks (eg val_loss is only on last pp stage, so...
rerouted this change to `model_provider.py` and included in commit e002b5c23c0f8681bbf42df1c3eb07ec27ff5446
Hi @yangzhipeng1108, looking at the error, > ``` > 2: File "/usr/local/lib/python3.10/dist-packages/lightning_fabric/plugins/environments/torchelastic.py", line 63, in world_size > 2: return int(os.environ["WORLD_SIZE"]) > 2: File "/usr/lib/python3.10/os.py", line 680, in getitem > 2:...
@jbieniusiewi could you also add the same check for 2.0. The unfinished checkpoint filtering is [here](https://github.com/NVIDIA/NeMo/blob/main/nemo/lightning/resume.py#L112-L113) in 2.0.
~~-@ashors1 Do we have/plan to have tests for AutoResume?~~ Looks like they are in `tests/lightning/test_nemo_logger.py`. @jbieniusiewi final request, can we include the same test you added for exp manager to...
Should wait to merge until after FP8 is verified.
Seems like we don't have the ability to save '.nemo' in 2.0 right?