spleeter icon indicating copy to clipboard operation
spleeter copied to clipboard

[Discussion] Can't seem to get the GPU working for training; maybe I'm to early in the pipeline to see it?

Open dustyny opened this issue 3 years ago • 3 comments

I've been trying for the past week to get GPU acceleration working and I don't think I'm getting it. I've tried a dedicated VM, the container, version 2.3.0 & 2.2.2.. I've lost count of all the different cuda, tensorflow configurations I've tried.. They all work when I test tensorflow directly for GPU access (I test pyTorch as well) and it seems to have acess.. I thought I got it to work once, I saw GPU memory go up to 20GB (half of the 40GB that was available) but after 6 hours the most I ever saw was the GPU spike up to 15% for 1 sec. Next time I tried the same job on a different config the memory never went above 300MB and GPU usage never went above 0%. When I run nvidia-smi I see the python job listed.

Now I'm not sure if this is because I haven't figured out how to get the GPU setup properly or if there is a very long and costly loading phase (I have about 450k wavs). I see the wavs being loaded in the STDOUT logging, they are getting cached to disk.. But should I be using GPU during this phase or do all the files need to be converted before training really starts?

Can someone please provide a brief explanation of what to expect when the training is working properly.. I can't tell if I'm doing something wrong or not..

dustyny avatar Jan 24 '22 14:01 dustyny

Hi @dustyny, The spleeter data pipeline for training uses audio features (spectrograms) caching, and the spectrogram computation is done on CPU. If you have a quite big database, spectrogram computation can take quite long and during this time, GPU won't be much used. To check that the GPU are correctly used, I would recommend to try with a fake test config with a very small dataset : for such a config, spectrogram computation and caching should be quite fast and the GPU should start to get used a lot quite soon (if correctly setup).

romi1502 avatar Jan 28 '22 09:01 romi1502

@romi1502 thanks for confirming my suspicion. Is there a way to split the spectrogram computing stage out? Right now I'm limited to 12 CPUs when using a GPU but if I can split the stages, I can use 96 CPUs to create the spectrograms and then switch to an instance with a GPU and run the training.

Ideally I can distribute the CPU processing to Spark/Beam/Ray cluster and then I can scale up beyond 1000 CPUs

dustyny avatar Jan 29 '22 13:01 dustyny

@romi1502

I ran the test on a smaller set as you suggested and did confirm the GPU was being used.. I didn't see it fully utilized (60% i think was the peak).. I'm not sure if that was the size of the data that I used as it was only around 50MBs..

One more related question.. I'm running training now.. I'm trying to get a sense of how long caching is going to take. I see the files logged to the STDOUT but as you know, it doesn't say where you are in the overall progress..

I'm thinking I can get a sense of that based on the size of the cache? I have a 500GB training set, right now I see the cache is taking about 113GB.. But I don't know if there is a 1:1 relationship between the size of the wav file and the data generated in the cache, if it's less or more..

EDITED: After a quick test it looks like caching roughly 1.5x the dataset size

dustyny avatar Jan 30 '22 18:01 dustyny