Sandeep Subramanian
Sandeep Subramanian
Hi @victormilewski1994 the beam_search.py code that I have right now is non-functional, I would not recommend using it.
Explicit seeding for NeMo NMT is actually a bad idea because assuming you don't finish an epoch before your job ends, you will end up seeing the same examples again...
Yes that is an option. But I think the `trainer.global_step` or the `is_being_restored` flag only becomes available to you after `trainer.fit()` so you'd need to seed inside the model instead...
Thanks for fixing that! Can we remove seeding for Megatron NMT? In Megatron NMT, if you use text_memmap or bin_memmap datasets, the dataloader order is already deterministic and so is...
Were you able to figure out how to seed only if the model isn't being restored?
So, the 22.05 container and `r1.10.0` works, but 22.06 and the current main branch does not?
Apologies, I forgot to hit submit on a draft I had earlier. Sorry, you're seeing this issue because megatron-based NMT is not supported properly with tarred datasets while it is...
Yeah this should happen for all megatron-based models in the 22.07 container. You need to install apex from the commit specified in our README on top of the 22.07 container.
Specifically, ``` git clone https://github.com/NVIDIA/apex cd apex git checkout 3c19f1061879394f28272a99a7ea26d58f72dace pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" --global-option="--fast_layer_norm" ./ ```
Do you run out of memory even with batch size 1? If not, the easiest fix is to just reduce the batch size. Another thing you can do is to...