MOTRv2 icon indicating copy to clipboard operation
MOTRv2 copied to clipboard

Questions about Multi-gpu training

Open hahapt opened this issue 2 years ago • 2 comments

Why I try to train it with 4 V100, it occured CUDA OUT OF MEMORY during epoch 0. I set nproc_per_node=4 in train.sh, is there anything wrong?

hahapt avatar Dec 28 '22 12:12 hahapt

We tried training on 8 V100 GPUs without checkpointing and do not encounter a CUDA OOM issue.

If OOM occurs in the middle of an epoch, maybe MOTR cumulates too many false positive track queries, resulting in a large decoder memory consumption. You may try the --use_checkpoint argument on V100 GPUs as well.

zyayoung avatar Dec 29 '22 16:12 zyayoung

这个问题出现在一开始,这导致我们无法开始训练,请问如何调小以满足显存要求?

Paige-Norton avatar Jul 09 '23 03:07 Paige-Norton