MOTRv2 Questions about Multi-gpu training

Questions about Multi-gpu training

Open hahapt opened this issue 2 years ago • 2 comments

Why I try to train it with 4 V100, it occured CUDA OUT OF MEMORY during epoch 0. I set nproc_per_node=4 in train.sh, is there anything wrong?

Dec 28 '22 12:12 hahapt

We tried training on 8 V100 GPUs without checkpointing and do not encounter a CUDA OOM issue.

If OOM occurs in the middle of an epoch, maybe MOTR cumulates too many false positive track queries, resulting in a large decoder memory consumption. You may try the --use_checkpoint argument on V100 GPUs as well.

Dec 29 '22 16:12 zyayoung

这个问题出现在一开始，这导致我们无法开始训练，请问如何调小以满足显存要求？

Jul 09 '23 03:07 Paige-Norton

MOTRv2 MOTRv2 copied to clipboard

Questions about Multi-gpu training

MOTRv2
MOTRv2 copied to clipboard