Stark icon indicating copy to clipboard operation
Stark copied to clipboard

How to train on two cards?

Open hongsheng-Z opened this issue 3 years ago • 1 comments
trafficstars

Hi,I used two cards to train my model according to your instructions, and the following error occurred:

File "/home/UserDirectory/hongshengz/anaconda3/lib/python3.9/site-packages/torch/autograd/init.py", line 130, in backward Variable._execution_engine.run_backward( RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [256]] is at version 7; expected version 6 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

Restarting training from last epoch ...

But it works normally when I use single card

hongsheng-Z avatar Dec 18 '21 05:12 hongsheng-Z

@hongsheng-Z Hi, the training command on N GPUs should be like "python tracking/train.py --script stark_st1 --config baseline --save_dir . --mode multiple --nproc_per_node N" But according to your error information, there might be other problems. Please check the problem with the help of "with torch.autograd.set_detect_anomaly(True)"

MasterBin-IIAU avatar Dec 22 '21 10:12 MasterBin-IIAU