Hi,
Thanks for your great work.
I would like to train the model using multiple GPUs but I receive this error:
" RuntimeError: CUDA error: invalid device ordinal CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect."
by running this code:
CUDA_VISIBLE_DEVICES=0,1 singularity exec --nv --writable-tmpfs -B /work/myname/ /work/myname/pointr.sif bash ./scripts/dist_train.sh 2 13232 --config ./cfgs/PCN_models/PoinTr.yaml --exp_name example
Note that I do not have any problem when using single gpu
Hi, can you provide the more details about your issue, like logs, cuda version, number of gpus on your server ...
this is the complete error, cuda version is 10.2 and I have 4 GPUs tesla v100
File "main.py", line 68, in
main()
File "main.py", line 64, in main
run_net(args, config, train_writer, val_writer)
File "/work/semohammadi/PoinTr/tools/runner.py", line 26, in run_net
base_model.to(args.local_rank)
File "/home/semohammadi/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 899, in to
return self._apply(convert)
File "/home/semohammadi/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 570, in _apply
module._apply(fn)
File "/home/semohammadi/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 570, in _apply
module._apply(fn)
File "/home/semohammadi/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 570, in _apply
module._apply(fn)
File "/home/semohammadi/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 593, in _apply
param_applied = fn(param)
File "/home/semohammadi/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 897, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.