fairseq-image-captioning icon indicating copy to clipboard operation
fairseq-image-captioning copied to clipboard

WARNING: attempting to recover from OOM in forward/backward pass

Open pzhren opened this issue 5 years ago • 10 comments

Hi, I encountered some errors during the Self-critical sequence training stage: WARNING: attempting to recover from OOM in forward/backward pass Is this because the GPU memory is not enough? It feels very strange, because sometimes it is normal.

pzhren avatar Nov 06 '20 02:11 pzhren

Is this because the GPU memory is not enough?

Yes, this is the reason. The settings documented in the README are appropriate for 2 GTX 1080 cards (8 GB each).

krasserm avatar Nov 06 '20 06:11 krasserm

In fact, I used 3 GPUs, each of which is 11g. The strange thing is that sometimes it works normally, and sometimes it is reported that the storage is insufficient.

pzhren avatar Nov 06 '20 08:11 pzhren

Did you pre-train the model with CE loss before running SCST?

krasserm avatar Nov 06 '20 08:11 krasserm

Yes. I passed --max-sentences 2, and it ran normally, but I was worried that it would affect performance. I don't know if it will have a significant impact? Besides, why not use .checkpoint/checkpoint_best.pt, is this not the best weight?

pzhren avatar Nov 06 '20 08:11 pzhren

Convergence improves with higher --max-sentences values (but also requires more memory). A value of 5 should work fine on 11 GB cards.

Regarding checkpoint_best.pt, this is the checkpoint with the best CE validation loss, but not necessarily with the best CIDEr score (or any other evaluation metric). Checkpoint selection based on a user defined metric should be automated but I had other priorities in the past. Hope I can resume work on it anytime soon. Pull requests are welcome too, of course!

krasserm avatar Nov 06 '20 08:11 krasserm

I see, thank you.

pzhren avatar Nov 06 '20 08:11 pzhren

image In fact, we found that when SCST was running, one of the GPU memory loads suddenly became too high. There is a serious load imbalance between GPUs. Do you have a good solution?

pzhren avatar Nov 06 '20 09:11 pzhren

Don't worry, I found that during the running process, the memory usage gradually increased. This is the running state at --max-sentence 3. image

pzhren avatar Nov 06 '20 09:11 pzhren

What is the frequency of OOMs when you run with --max-sentences 5 or 8?

krasserm avatar Nov 07 '20 05:11 krasserm

Almost every time I encounter it, the strange thing is that it reports a memory error after SCST runs one or two.

pzhren avatar Nov 07 '20 06:11 pzhren