lingvo
lingvo copied to clipboard
Question about low GPU utilization in async training in ASR task
I have a total of two GPU servers and one CPU server. I ran 4 trainers on each GPU server. I tried to run the ps and controller jobs on different servers. But every time the results are I didn't get the best GPU utilization for both servers. That is, every time I have one or two servers which has low GPU utilization (often down to 0%).
Please see the following two pictures(screen at different time of the same training). The bottom part of the picture is the server whose GPU utilization is unstable and often drops to 0%. The top part of the picture is the server whose GPU utilization is normal(30%~70%).
Here are a few ways I tried, and I found that there is always a server with a low GPU utilization. (192.168.68.51, 192.168.68.69 are GPU server, 192.168.68.71 is CPU server )
192.168.68.51 runs 1 ps job, 4 trainer job, 192.168.68.69 runs 1 controller job, 4 trainer job, 192.168.68.69 has low GPU utilization(often reduce to 0%) (speed 1.9s/step)
192.168.68.69 runs 1 ps job, 4 trainer job, 192.168.68.51 runs 1 controller job, 4 trainer job, 192.168.68.51 has low GPU utilization(often reduce to 0%)
192.168.68.69 runs 1 ps job, 4 trainer job, 192.168.68.51 runs 4 trainer job, 192.168.68.71runs 1 controller job, 192.168.68.51 has low GPU utilization(often reduce to 0%)
192.168.68.51 runs 1 ps job, 4 trainer job, 192.168.68.69 runs 4 trainer job, 192.168.68.71runs 1 controller job, 192.168.68.69 has low GPU utilization(often reduce to 0%)
192.168.68.51 runs 1 ps job, 4 trainer job, 192.168.68.69 runs 1 ps job, 4 trainer job, 192.168.68.71runs 1 controller job, 192.168.68.51,192.168.68.69 has low GPU utilization(often reduce to 0%), speed has also become slow(6s/step)
192.168.68.51 runs 4 trainer job, 192.168.68.69 runs 4 trainer job, 192.168.68.71runs 2 ps job, 1 controller job, 192.168.68.51 has low GPU utilization(often reduce to 0%), speed has also become slow(3s/step)
192.168.68.51 runs 1 ps job, 4 trainer job, 192.168.68.69 runs 4 trainer job, 192.168.68.71runs 1 ps job, 1 controller job, 192.168.68.51,192.168.68.69 has low GPU utilization(often reduce to 0%), speed has also become slow(6s/step)
I want to be able to take full advantage of the GPU to both GPU servers. Could you help me?
Can you share your command?Thanks.