spleeter
spleeter copied to clipboard
[Discussion] Can't seem to get the GPU working for training; maybe I'm to early in the pipeline to see it?
I've been trying for the past week to get GPU acceleration working and I don't think I'm getting it. I've tried a dedicated VM, the container, version 2.3.0 & 2.2.2.. I've lost count of all the different cuda, tensorflow configurations I've tried.. They all work when I test tensorflow directly for GPU access (I test pyTorch as well) and it seems to have acess.. I thought I got it to work once, I saw GPU memory go up to 20GB (half of the 40GB that was available) but after 6 hours the most I ever saw was the GPU spike up to 15% for 1 sec. Next time I tried the same job on a different config the memory never went above 300MB and GPU usage never went above 0%. When I run nvidia-smi I see the python job listed.
Now I'm not sure if this is because I haven't figured out how to get the GPU setup properly or if there is a very long and costly loading phase (I have about 450k wavs). I see the wavs being loaded in the STDOUT logging, they are getting cached to disk.. But should I be using GPU during this phase or do all the files need to be converted before training really starts?
Can someone please provide a brief explanation of what to expect when the training is working properly.. I can't tell if I'm doing something wrong or not..
Hi @dustyny, The spleeter data pipeline for training uses audio features (spectrograms) caching, and the spectrogram computation is done on CPU. If you have a quite big database, spectrogram computation can take quite long and during this time, GPU won't be much used. To check that the GPU are correctly used, I would recommend to try with a fake test config with a very small dataset : for such a config, spectrogram computation and caching should be quite fast and the GPU should start to get used a lot quite soon (if correctly setup).
@romi1502 thanks for confirming my suspicion. Is there a way to split the spectrogram computing stage out? Right now I'm limited to 12 CPUs when using a GPU but if I can split the stages, I can use 96 CPUs to create the spectrograms and then switch to an instance with a GPU and run the training.
Ideally I can distribute the CPU processing to Spark/Beam/Ray cluster and then I can scale up beyond 1000 CPUs
@romi1502
I ran the test on a smaller set as you suggested and did confirm the GPU was being used.. I didn't see it fully utilized (60% i think was the peak).. I'm not sure if that was the size of the data that I used as it was only around 50MBs..
One more related question.. I'm running training now.. I'm trying to get a sense of how long caching is going to take. I see the files logged to the STDOUT but as you know, it doesn't say where you are in the overall progress..
I'm thinking I can get a sense of that based on the size of the cache? I have a 500GB training set, right now I see the cache is taking about 113GB.. But I don't know if there is a 1:1 relationship between the size of the wav file and the data generated in the cache, if it's less or more..
EDITED: After a quick test it looks like caching roughly 1.5x the dataset size