pytorch-ssd Update train_ssd.py to support multiple GPUs

Hello @dusty-nv,

According to your suggestion here Support multiple GPU and the issue referenced here @Mystique-orca and I have enabled multiple GPUs to support training SSD-based Object Detection Model using PyTorch Framework.

We have tested the modified train_ssd.py on our environment for Object Detection using 3 Nvidia Tesla T4 GPUs. We can pass the number of GPUs we need using the argument --gpu-devices .

For e.g:

python train_ssd.py --dataset-type=voc --data=<path-to-dataset-dir> --model-dir=<path-to-model-dir> --batch-size=12 --epochs=400 --workers=0 --use-cuda=True --gpu-devices 0 1 2

Please let us know if we could provide more information.

Hope this will help the community!

Thanks

Aug 27 '21 14:08 NISHANTSHRIVASTAV

Thanks @NISHANTSHRIVASTAV - can you make this work on a single GPU (i.e. Jetson) just the same that it did previously? If it required no changes in CLI arguments/ect on the single-GPU use-case I would merge it.

Aug 27 '21 15:08 dusty-nv

Thanks @NISHANTSHRIVASTAV - can you make this work on a single GPU (i.e. Jetson) just the same that it did previously? If it required no changes in CLI arguments/ect on the single-GPU use-case I would merge it.

@dusty-nv Yes, it will work on a single GPU using the same CLI argument i.e --gpu-devices where we just need to pass the index of the GPU

For e.g:

For single GPU

python train_ssd.py --dataset-type=voc --data=<path-to-dataset-dir> --model-dir=<path-to-model-dir> --batch-size=4 --epochs=400 --workers=0 --use-cuda=True --gpu-devices 0

For 2 GPUs

python train_ssd.py --dataset-type=voc --data=<path-to-dataset-dir> --model-dir=<path-to-model-dir> --batch-size=4 --epochs=400 --workers=0 --use-cuda=True --gpu-devices 0 1

For n GPUs

python train_ssd.py --dataset-type=voc --data=<path-to-dataset-dir> --model-dir=<path-to-model-dir> --batch-size=4 --epochs=400 --workers=0 --use-cuda=True --gpu-devices 0 1 .. n

Aug 27 '21 15:08 NISHANTSHRIVASTAV

The default should be --gpu-devices 0. I also meant that I would prefer it not to use net.DataParallel() if only 1 GPU is being used, as I don't want there to be any unintended side-effects when running on Jetson systems (especially the memory-limited Nano 2GB device)

Aug 27 '21 16:08 dusty-nv

The default should be --gpu-devices 0. I also meant that I would prefer it not to use net.DataParallel() if only 1 GPU is being used, as I don't want there to be any unintended side-effects when running on Jetson systems (especially the memory-limited Nano 2GB device)

Hi @dusty-nv,

We have modified the SSD-based Object Detection Training implementation using Multiple GPUs to work on the default single GPU i.e Jetson according to your suggestions in the latest commit. For training with multiple GPUs, it will use the net.DataParallel model and for training, with a single GPU specifically on Jetson, it will use the default net model without any change in the CLI arguments.

Please let us know if we could provide more information.

Thanks

Aug 28 '21 12:08 NISHANTSHRIVASTAV

Hi @dusty-nv As @NISHANTSHRIVASTAV mentioned, the code will work as it did before, when CLI argument for gpu-devices is not provided or default command is used. The net.DataParallel model will be used, only when there are more than one gpu-devices provided.

Can you let us know if this request can be merged? If there are some suggestions or changes required, we are open to incorporate those too.

Many thanks!

Sep 16 '21 06:09 Mystique-orca

Hello, i've been trying to apply these changes into my 1_train_ssd as i also want to apply a MultiGPU training, but have been facing the recurrent error: 'RuntimeError: Expected tensor for argument #1 'input' to have the same device as tensor for argument #2 'weight'; but device 1 does not equal 0 (while checking arguments for cudnn_batch_norm)'

Did you had a similar issue or know where i'm making a mistake?

This is my first Computer Vision project and i would really appreciate your input! Thanks

Sep 23 '21 10:09 Gcardoso233

pytorch-ssd pytorch-ssd copied to clipboard

Update train_ssd.py to support multiple GPUs

pytorch-ssd
pytorch-ssd copied to clipboard