sdfstudio icon indicating copy to clipboard operation
sdfstudio copied to clipboard

Slow traning speed with A100

Open DEVECLOVER opened this issue 1 year ago • 2 comments

Configure the environment according to the tutorial, test the training example-dtu-scan65 , and find that the training speed is very slow, however the tutorial says that the 3090 can reach the training speed of 15 minutes and 20K iterations: image

my training command 、info and progress as follows:

ns-train neus-facto --pipeline.model.sdf-field.inside-outside False --vis viewer --experiment-name neus-facto-dtu65 sdfstudio-data --data data/sdfstudio-demo-data/dtu-scan65

image image image

As we can see, the GPU utilization is very low, I haven't made any changes to the source code, don't know what caused it. Criticism and correction are welcome, thank you!!!

DEVECLOVER avatar Nov 03 '23 07:11 DEVECLOVER

Hi, this is strange. In our cluster, we found the code will try to use all available cpus and there might be some conflicts in data loader so it's much slower. You could try to add OMP_NUM_THREADS=4 to the training command. This solve our issue but I am not sure if it is helpful for you.

niujinshuchong avatar Nov 05 '23 11:11 niujinshuchong

@niujinshuchong Thanks a lot. By adding OMP_NUM_THREADS=4, the training speed is indeed improved, but the GPU-Util is only about 40%, please is there any other way to further improve the GPU-Util? Thanks!!!

DEVECLOVER avatar Nov 07 '23 02:11 DEVECLOVER