Stark
Stark copied to clipboard
How to train on two cards?
Hi,I used two cards to train my model according to your instructions, and the following error occurred:
File "/home/UserDirectory/hongshengz/anaconda3/lib/python3.9/site-packages/torch/autograd/init.py", line 130, in backward Variable._execution_engine.run_backward( RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [256]] is at version 7; expected version 6 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
Restarting training from last epoch ...
But it works normally when I use single card
@hongsheng-Z Hi, the training command on N GPUs should be like "python tracking/train.py --script stark_st1 --config baseline --save_dir . --mode multiple --nproc_per_node N" But according to your error information, there might be other problems. Please check the problem with the help of "with torch.autograd.set_detect_anomaly(True)"