pytorch-ssd
pytorch-ssd copied to clipboard
Update train_ssd.py to support multiple GPUs
Hello @dusty-nv,
According to your suggestion here Support multiple GPU and the issue referenced here @Mystique-orca and I have enabled multiple GPUs to support training SSD-based Object Detection Model using PyTorch Framework.
We have tested the modified train_ssd.py
on our environment for Object Detection using 3 Nvidia Tesla T4 GPUs. We can pass the number of GPUs we need using the argument --gpu-devices
.
For e.g:
python train_ssd.py --dataset-type=voc --data=<path-to-dataset-dir> --model-dir=<path-to-model-dir> --batch-size=12 --epochs=400 --workers=0 --use-cuda=True --gpu-devices 0 1 2
Please let us know if we could provide more information.
Hope this will help the community!
Thanks
Thanks @NISHANTSHRIVASTAV - can you make this work on a single GPU (i.e. Jetson) just the same that it did previously? If it required no changes in CLI arguments/ect on the single-GPU use-case I would merge it.
Thanks @NISHANTSHRIVASTAV - can you make this work on a single GPU (i.e. Jetson) just the same that it did previously? If it required no changes in CLI arguments/ect on the single-GPU use-case I would merge it.
@dusty-nv Yes, it will work on a single GPU using the same CLI argument i.e --gpu-devices
where we just need to pass the index of the GPU
For e.g:
For single GPU
python train_ssd.py --dataset-type=voc --data=<path-to-dataset-dir> --model-dir=<path-to-model-dir> --batch-size=4 --epochs=400 --workers=0 --use-cuda=True --gpu-devices 0
For 2 GPUs
python train_ssd.py --dataset-type=voc --data=<path-to-dataset-dir> --model-dir=<path-to-model-dir> --batch-size=4 --epochs=400 --workers=0 --use-cuda=True --gpu-devices 0 1
For n GPUs
python train_ssd.py --dataset-type=voc --data=<path-to-dataset-dir> --model-dir=<path-to-model-dir> --batch-size=4 --epochs=400 --workers=0 --use-cuda=True --gpu-devices 0 1 .. n
The default should be --gpu-devices 0
. I also meant that I would prefer it not to use net.DataParallel()
if only 1 GPU is being used, as I don't want there to be any unintended side-effects when running on Jetson systems (especially the memory-limited Nano 2GB device)
The default should be
--gpu-devices 0
. I also meant that I would prefer it not to usenet.DataParallel()
if only 1 GPU is being used, as I don't want there to be any unintended side-effects when running on Jetson systems (especially the memory-limited Nano 2GB device)
Hi @dusty-nv,
We have modified the SSD-based Object Detection Training implementation using Multiple GPUs to work on the default single GPU i.e Jetson according to your suggestions in the latest commit. For training with multiple GPUs, it will use the net.DataParallel
model and for training, with a single GPU specifically on Jetson, it will use the default net
model without any change in the CLI arguments.
Please let us know if we could provide more information.
Thanks
Hi @dusty-nv
As @NISHANTSHRIVASTAV mentioned, the code will work as it did before, when CLI argument for gpu-devices is not provided or default command is used. The net.DataParallel
model will be used, only when there are more than one gpu-devices provided.
Can you let us know if this request can be merged? If there are some suggestions or changes required, we are open to incorporate those too.
Many thanks!
Hello, i've been trying to apply these changes into my 1_train_ssd as i also want to apply a MultiGPU training, but have been facing the recurrent error: 'RuntimeError: Expected tensor for argument #1 'input' to have the same device as tensor for argument #2 'weight'; but device 1 does not equal 0 (while checking arguments for cudnn_batch_norm)'
Did you had a similar issue or know where i'm making a mistake?
This is my first Computer Vision project and i would really appreciate your input! Thanks