STCN
STCN copied to clipboard
When training on single 4090, the GPU Util fluctuates a lot and the time estimated for training is very long.
I find that it costs share GPU memory. The bigger the num_worker sets, the more it costs. Is this the reason?
When setting num_worker=8
retrain_s0 - It 100 [TRAIN] [time ]: 1.0258257
When setting num_worker=16
retrain_s0 - It 100 [TRAIN] [time ]: 1.1624295
I'm afraid that I cannot debug your training setup. The general advice is to check for CPU/GPU/IO bottlenecks.
Thanks for your promt reply. In greater detail, my CPU is "i9-13900KF" GPU is "NVIDIA GeForce RTX 4090" hard disk is "Kingston KC3000 PCIe 4.0 NVMe M.2 SSD" memory is "威刚 DDR5 6000MHz 32GB x 2" When I use iotop for monitoring IO Performance, I find that the Total DISK READ is always no more than 1000 K/s. So I think IO bottleneck is the problem and I am really confused about that.
When I train on stage 3, the speed is quite fast and the time estimated is about 0.11. I wonder why such huge difference between stage 0 and 03. Is the reason my CPU is not powerful enough? And do you have any advice to speed up training in stage 0 without losing performance? Thanks for your excellent work!
Sound like a CPU bottleneck but your CPU is not at full load. Perhaps try increasing the number of data loaders?
It's quite strange that without making any modifications to the code or the experimental environment, the training process has significantly sped up now, with fluctuations ranging from 0.4+ to 0.6+.
It's quite strange that without making any modifications to the code or the experimental environment, the training process has significantly sped up now, with fluctuations ranging from 0.4+ to 0.6+.
I reconfigured the environment, and now the speed has improved. The S0 is around 0.3.