fairseq-image-captioning
fairseq-image-captioning copied to clipboard
WARNING: attempting to recover from OOM in forward/backward pass
Hi, I encountered some errors during the Self-critical sequence training stage: WARNING: attempting to recover from OOM in forward/backward pass Is this because the GPU memory is not enough? It feels very strange, because sometimes it is normal.
Is this because the GPU memory is not enough?
Yes, this is the reason. The settings documented in the README are appropriate for 2 GTX 1080 cards (8 GB each).
In fact, I used 3 GPUs, each of which is 11g. The strange thing is that sometimes it works normally, and sometimes it is reported that the storage is insufficient.
Did you pre-train the model with CE loss before running SCST?
Yes. I passed --max-sentences 2, and it ran normally, but I was worried that it would affect performance. I don't know if it will have a significant impact? Besides, why not use .checkpoint/checkpoint_best.pt, is this not the best weight?
Convergence improves with higher --max-sentences values (but also requires more memory). A value of 5 should work fine on 11 GB cards.
Regarding checkpoint_best.pt, this is the checkpoint with the best CE validation loss, but not necessarily with the best CIDEr score (or any other evaluation metric). Checkpoint selection based on a user defined metric should be automated but I had other priorities in the past. Hope I can resume work on it anytime soon. Pull requests are welcome too, of course!
I see, thank you.
In fact, we found that when SCST was running, one of the GPU memory loads suddenly became too high. There is a serious load imbalance between GPUs. Do you have a good solution?
Don't worry, I found that during the running process, the memory usage gradually increased. This is the running state at --max-sentence 3.

What is the frequency of OOMs when you run with --max-sentences 5 or 8?
Almost every time I encounter it, the strange thing is that it reports a memory error after SCST runs one or two.