bisenet-tensorflow icon indicating copy to clipboard operation
bisenet-tensorflow copied to clipboard

How to start multi-gpu training to accelerate the training process?

Open billplzhang opened this issue 5 years ago • 7 comments

I start training with train.py.but the process is lengthy since there is only one GPU involved in the training process.SO how to start the multi-gpu training .I have 8 gpus in my computer..Thank you.

billplzhang avatar Aug 05 '19 01:08 billplzhang

Check here https://github.com/pdoublerainbow/bisenet-tensorflow/blob/master/train.py#L94, If you only want to make use of all free gpu, you should comment the line. But are you sure all gpu are only free to you? If not , you can change the method auto_select_gpu() a little(here: https://github.com/pdoublerainbow/bisenet-tensorflow/blob/master/utils/misc_utils.py#L32) to make use of three largest GPU.

pdoublerainbow avatar Aug 05 '19 05:08 pdoublerainbow

Check here https://github.com/pdoublerainbow/bisenet-tensorflow/blob/master/train.py#L94, If you only want to make use of all free gpu, you should comment the line. But are you sure all gpu are only free to you? If not , you can change the method auto_select_gpu() a little(here: https://github.com/pdoublerainbow/bisenet-tensorflow/blob/master/utils/misc_utils.py#L32) to make use of three largest GPU.

I simply change the code #94. os.environ['CUDA_VISIBLE_DEVICES']='0,1,2,3', in order to make use of the four gpus.but it seems that training program only use no. 0 gpu not all the 4 gpus.

Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 29528 C python 8847MiB | | 1 29528 C python 209MiB | | 2 29528 C python 209MiB | | 3 29528 C python 209MiB |

billplzhang avatar Aug 05 '19 07:08 billplzhang

You should use nvidia-smi in the terminal to check your gpu free memory. @billplzhang

pdoublerainbow avatar Aug 05 '19 07:08 pdoublerainbow

You should use nvidia-smi in the terminal to check your gpu free memory. @billplzhang

before I start train.py:

NVIDIA-SMI 390.87 Driver Version: 390.87 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce GTX 108... Off | 00000000:04:00.0 Off | N/A | | 23% 28C P8 15W / 250W | 10MiB / 11178MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 1 GeForce GTX 108... Off | 00000000:05:00.0 Off | N/A | | 23% 29C P8 16W / 250W | 10MiB / 11178MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 2 GeForce GTX 108... Off | 00000000:08:00.0 Off | N/A | | 46% 78C P2 88W / 250W | 4433MiB / 11178MiB | 70% Default | +-------------------------------+----------------------+----------------------+ | 3 GeForce GTX 108... Off | 00000000:09:00.0 Off | N/A | | 23% 35C P8 15W / 250W | 10MiB / 11178MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 4 GeForce GTX 108... Off | 00000000:85:00.0 Off | N/A | | 23% 38C P8 15W / 250W | 8876MiB / 11178MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 5 GeForce GTX 108... Off | 00000000:86:00.0 Off | N/A | | 23% 31C P8 16W / 250W | 6848MiB / 11178MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 6 GeForce GTX 108... Off | 00000000:89:00.0 Off | N/A | | 23% 26C P8 16W / 250W | 10MiB / 11178MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 7 GeForce GTX 108... Off | 00000000:8A:00.0 Off | N/A | | 26% 42C P2 56W / 250W | 11056MiB / 11178MiB | 0% Default |

After I start train.py (specify 0 ,1,3 as visible).

NVIDIA-SMI 390.87 Driver Version: 390.87 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce GTX 108... Off | 00000000:04:00.0 Off | N/A | | 23% 38C P2 71W / 250W | 8859MiB / 11178MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 1 GeForce GTX 108... Off | 00000000:05:00.0 Off | N/A | | 23% 29C P8 16W / 250W | 221MiB / 11178MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 2 GeForce GTX 108... Off | 00000000:08:00.0 Off | N/A | | 43% 73C P2 88W / 250W | 4435MiB / 11178MiB | 38% Default | +-------------------------------+----------------------+----------------------+ | 3 GeForce GTX 108... Off | 00000000:09:00.0 Off | N/A | | 23% 35C P8 15W / 250W | 221MiB / 11178MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 4 GeForce GTX 108... Off | 00000000:85:00.0 Off | N/A | | 23% 38C P2 55W / 250W | 8876MiB / 11178MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 5 GeForce GTX 108... Off | 00000000:86:00.0 Off | N/A | | 23% 31C P8 16W / 250W | 6848MiB / 11178MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 6 GeForce GTX 108... Off | 00000000:89:00.0 Off | N/A | | 23% 26C P8 16W / 250W | 10MiB / 11178MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 7 GeForce GTX 108... Off | 00000000:8A:00.0 Off | N/A | | 23% 41C P2 56W / 250W | 11056MiB / 11178MiB | 0% Default |

Aparrently, GPU 1,3 is not used in the trainning process.So how to utilize the three gpu together.

billplzhang avatar Aug 05 '19 08:08 billplzhang

You just need to modify the batch size in the configuration.py.

pdoublerainbow avatar Aug 05 '19 09:08 pdoublerainbow

You just need to modify the batch size in the configuration.py.

How to modify?a larger batch size ?

billplzhang avatar Aug 05 '19 11:08 billplzhang

Yes, according to your free memory size. @billplzhang

pdoublerainbow avatar Aug 05 '19 12:08 pdoublerainbow