STCN icon indicating copy to clipboard operation
STCN copied to clipboard

When training on single 4090, the GPU Util fluctuates a lot and the time estimated for training is very long.

Open isksjsksk opened this issue 2 years ago • 4 comments

I find that it costs share GPU memory. The bigger the num_worker sets, the more it costs. Is this the reason? When setting num_worker=8 微信截图_20230824204027 retrain_s0 - It 100 [TRAIN] [time ]: 1.0258257

When setting num_worker=16 16 retrain_s0 - It 100 [TRAIN] [time ]: 1.1624295

isksjsksk avatar Aug 24 '23 13:08 isksjsksk

I'm afraid that I cannot debug your training setup. The general advice is to check for CPU/GPU/IO bottlenecks.

hkchengrex avatar Aug 24 '23 20:08 hkchengrex

Thanks for your promt reply. In greater detail, my CPU is "i9-13900KF" GPU is "NVIDIA GeForce RTX 4090" hard disk is "Kingston KC3000 PCIe 4.0 NVMe M.2 SSD" memory is "威刚 DDR5 6000MHz 32GB x 2" When I use iotop for monitoring IO Performance, I find that the Total DISK READ is always no more than 1000 K/s. So I think IO bottleneck is the problem and I am really confused about that.

isksjsksk avatar Aug 25 '23 03:08 isksjsksk

When I train on stage 3, the speed is quite fast and the time estimated is about 0.11. I wonder why such huge difference between stage 0 and 03. Is the reason my CPU is not powerful enough? And do you have any advice to speed up training in stage 0 without losing performance? Thanks for your excellent work!

isksjsksk avatar Aug 25 '23 15:08 isksjsksk

Sound like a CPU bottleneck but your CPU is not at full load. Perhaps try increasing the number of data loaders?

hkchengrex avatar Aug 25 '23 16:08 hkchengrex

It's quite strange that without making any modifications to the code or the experimental environment, the training process has significantly sped up now, with fluctuations ranging from 0.4+ to 0.6+.

isksjsksk avatar Mar 22 '24 03:03 isksjsksk

It's quite strange that without making any modifications to the code or the experimental environment, the training process has significantly sped up now, with fluctuations ranging from 0.4+ to 0.6+.

I reconfigured the environment, and now the speed has improved. The S0 is around 0.3.

isksjsksk avatar Mar 25 '24 11:03 isksjsksk