lingvo icon indicating copy to clipboard operation
lingvo copied to clipboard

Question about low GPU utilization in async training in ASR task

Open iamxiaoyubei opened this issue 5 years ago • 1 comments

I have a total of two GPU servers and one CPU server. I ran 4 trainers on each GPU server. I tried to run the ps and controller jobs on different servers. But every time the results are I didn't get the best GPU utilization for both servers. That is, every time I have one or two servers which has low GPU utilization (often down to 0%).

Please see the following two pictures(screen at different time of the same training). The bottom part of the picture is the server whose GPU utilization is unstable and often drops to 0%. The top part of the picture is the server whose GPU utilization is normal(30%~70%). image2019-7-22_11-51-53 image2019-7-22_11-51-13

Here are a few ways I tried, and I found that there is always a server with a low GPU utilization. (192.168.68.51, 192.168.68.69 are GPU server, 192.168.68.71 is CPU server )

192.168.68.51 runs 1 ps job, 4 trainer job, 192.168.68.69 runs 1 controller job, 4 trainer job, 192.168.68.69 has low GPU utilization(often reduce to 0%) (speed 1.9s/step)

192.168.68.69 runs 1 ps job, 4 trainer job, 192.168.68.51 runs 1 controller job, 4 trainer job, 192.168.68.51 has low GPU utilization(often reduce to 0%)

192.168.68.69 runs 1 ps job, 4 trainer job, 192.168.68.51 runs 4 trainer job, 192.168.68.71runs 1 controller job, 192.168.68.51 has low GPU utilization(often reduce to 0%)

192.168.68.51 runs 1 ps job, 4 trainer job, 192.168.68.69 runs 4 trainer job, 192.168.68.71runs 1 controller job, 192.168.68.69 has low GPU utilization(often reduce to 0%)

192.168.68.51 runs 1 ps job, 4 trainer job, 192.168.68.69 runs 1 ps job, 4 trainer job, 192.168.68.71runs 1 controller job, 192.168.68.51,192.168.68.69 has low GPU utilization(often reduce to 0%), speed has also become slow(6s/step)

192.168.68.51 runs 4 trainer job, 192.168.68.69 runs 4 trainer job, 192.168.68.71runs 2 ps job, 1 controller job, 192.168.68.51 has low GPU utilization(often reduce to 0%), speed has also become slow(3s/step)

192.168.68.51 runs 1 ps job, 4 trainer job, 192.168.68.69 runs 4 trainer job, 192.168.68.71runs 1 ps job, 1 controller job, 192.168.68.51,192.168.68.69 has low GPU utilization(often reduce to 0%), speed has also become slow(6s/step)

I want to be able to take full advantage of the GPU to both GPU servers. Could you help me?

iamxiaoyubei avatar Jul 22 '19 12:07 iamxiaoyubei

Can you share your command?Thanks.

shengzhang0222 avatar Aug 19 '19 08:08 shengzhang0222