NeMo
NeMo copied to clipboard
the NMT infer OOM
Is your feature request related to a problem? Please describe.
- Inference does not support automatic batch processing according to length, which leads to the OOM of excessively long sentences and makes it difficult to enlarge the beam.
- And I found that as the inference goes on, the occupation of CUDA memory is gradually increasing.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
Describe the solution you'd like
- support automatic batch processing according to length
- I have try to solve it by
torch.cuda.empty_cache()
Do you run out of memory even with batch size 1? If not, the easiest fix is to just reduce the batch size. Another thing you can do is to order your test set so that you have the longest sequences first so that if you do run out of memory, it happens right at the start rather than mid-way through translating your test set.
We don't have plans yet for specifying inference batch sizes based on tokens, but if you're able to implement this, we would welcome a pull request!