Sandeep Subramanian comments

Results 12 comments of


                                            Sandeep Subramanian

get_best() returns index 1

Hi @victormilewski1994 the beam_search.py code that I have right now is non-functional, I would not recommend using it.

Fix seeding in neural machine translation examples

Explicit seeding for NeMo NMT is actually a bad idea because assuming you don't finish an epoch before your job ends, you will end up seeing the same examples again...

Fix seeding in neural machine translation examples

Yes that is an option. But I think the `trainer.global_step` or the `is_being_restored` flag only becomes available to you after `trainer.fit()` so you'd need to seed inside the model instead...

Fix seeding in neural machine translation examples

Thanks for fixing that! Can we remove seeding for Megatron NMT? In Megatron NMT, if you use text_memmap or bin_memmap datasets, the dataloader order is already deterministic and so is...

Fix seeding in neural machine translation examples

Were you able to figure out how to seed only if the model isn't being restored?

Megatron: NameError: name 'ensure_divisibility' is not defined

So, the 22.05 container and `r1.10.0` works, but 22.06 and the current main branch does not?

[NMT] Unable to run `megatron_nmt_training.py`

Apologies, I forgot to hit submit on a draft I had earlier. Sorry, you're seeing this issue because megatron-based NMT is not supported properly with tarred datasets while it is...

[NMT] Unable to run `megatron_nmt_training.py`

Yeah this should happen for all megatron-based models in the 22.07 container. You need to install apex from the commit specified in our README on top of the 22.07 container.

[NMT] Unable to run `megatron_nmt_training.py`

Specifically, ``` git clone https://github.com/NVIDIA/apex cd apex git checkout 3c19f1061879394f28272a99a7ea26d58f72dace pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" --global-option="--fast_layer_norm" ./ ```

the NMT infer OOM

Do you run out of memory even with batch size 1? If not, the easiest fix is to just reduce the batch size. Another thing you can do is to...