Music-Source-Separation-Training icon indicating copy to clipboard operation
Music-Source-Separation-Training copied to clipboard

Accelerate version not working properly

Open rimb05 opened this issue 1 year ago • 26 comments

I tried training a model using:

accelerate launch train_accelerate.py

I get this output: /usr/local/lib/python3.10/dist-packages/transformers/utils/generic.py:441: FutureWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead. _torch_pytree._register_pytree_node( The following values were not passed to accelerate launch and had defaults used instead: --num_processes was set to a value of 6 More than one GPU was found, enabling multi-GPU training. If this was unintended please pass in --num_processes=1. --num_machines was set to a value of 1 --dynamo_backend was set to a value of 'no' To avoid this warning pass in values for each of the problematic parameters or run accelerate config.

It continues to train after the warning, but the loss value is always 'nan' and the validation results in 0.0dB for all stems.

rimb05 avatar Aug 20 '24 03:08 rimb05

Did you run accelerate config?

ZFTurbo avatar Aug 20 '24 06:08 ZFTurbo

Yes, I did and I chose all the default options.

rimb05 avatar Aug 20 '24 09:08 rimb05

I see you have some problem with nans. Try to choose float32 for training. Do you have the same problem on standard train.py script?

ZFTurbo avatar Aug 20 '24 09:08 ZFTurbo

The same training works fine without accelerate (train.py). How would I enable float32, do you mean in accelerate config? I am training with mdx23c model.

rimb05 avatar Aug 20 '24 10:08 rimb05

I tried it again with fp32 ("use_amp" set to false), but I still get nans after a while. I also tried htdemucs instead of mdx23c. It made no difference. When I run the same training runs without accelerate, it runs fine for hourse. See my output below. You can see that after a couple of runs, I get nans for the loss, then the validation returns all zeros. I would really appreciate any insight.

/usr/local/lib/python3.10/dist-packages/transformers/utils/generic.py:441: FutureWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead. _torch_pytree._register_pytree_node( The following values were not passed to accelerate launch and had defaults used instead: --num_processes was set to a value of 6 More than one GPU was found, enabling multi-GPU training. If this was unintended please pass in --num_processes=1. --num_machines was set to a value of 1 --mixed_precision was set to a value of 'no' --dynamo_backend was set to a value of 'no' To avoid this warning pass in values for each of the problematic parameters or run accelerate config. Instruments: ['vocal', 'drums', 'guitar', 'bass', 'piano', 'synth'] Old metadata was used for 24603 tracks. Old metadata was used for 24603 tracks. Old metadata was used for 24603 tracks. Old metadata was used for 24603 tracks. Old metadata was used for 24603 tracks.

0it [00:00, ?it/s] 0it [00:00, ?it/s] 0it [00:00, ?it/s] 0it [00:00, ?it/s] 0it [00:00, ?it/s] Use augmentation for training Dataset type: 1 Processes to use: 64 Collecting metadata for ['training_output'] Found metadata cache file: results3/metadata_1.pkl Old metadata was used for 24603 tracks. 0it [00:00, ?it/s] Found tracks in dataset: 24603 Processes GPU: 6 Patience: 2 Reduce factor: 0.95 Batch size: 4 Grad accum steps: 1 Effective batch size: 4 Optimizer: adam 100%|█| 10/10 [00:24<00:00, 2.41s/it, sdr_vocal=-0.0673, sdr_drums=-0.0126, sdr_ Valid length: 59 Instr SDR vocal: -0.0731 Debug: 60 Instr SDR vocal: -0.0733 Debug: 60 Valid length: 59 Instr SDR drums: -0.0027 Debug: 60 Instr SDR drums: -0.0020 Debug: 60 Valid length: 59 Instr SDR guitar: -6.5638 Debug: 60 Instr SDR guitar: -6.6667 Debug: 60 Valid length: 59 Instr SDR bass: -3.7258 Debug: 60 Instr SDR bass: -3.7697 Debug: 60 Valid length: 59 Instr SDR piano: -8.7960 Debug: 60 Instr SDR piano: -8.7994 Debug: 60 Valid length: 59 Instr SDR synth: -2.4083 Debug: 60 Instr SDR synth: -2.4107 Debug: 60 SDR Avg: -3.6203 Train for: 1000 Train epoch: 0 Learning rate: 9e-05 100%|████████| 1000/1000 [10:16<00:00, 1.62it/s, loss=0.0779, avg_loss=6.87e+3] Training loss: 68.670630 100%|█| 10/10 [00:23<00:00, 2.33s/it, sdr_vocal=2.84, sdr_drums=1.79, sdr_guitar=-1. Instr SDR vocal: 2.5106 Debug: 60 Instr SDR drums: 1.3487 Debug: 60 Instr SDR guitar: -7.0081 Debug: 60 Instr SDR bass: -5.7739 Debug: 60 Instr SDR piano: -6.0657 Debug: 60 Instr SDR synth: -2.3428 Debug: 60 SDR Avg: -2.8885 Store weights: results3/model_htdemucs_ep_0_sdr_-2.8885.ckpt Train epoch: 1 Learning rate: 9e-05 100%|███████████████| 1000/1000 [10:18<00:00, 1.62it/s, loss=nan, avg_loss=nan] Training loss: nan 100%|█| 10/10 [00:24<00:00, 2.42s/it, sdr_vocal=0, sdr_drums=0, sdr_guitar=0, sdr_ba Instr SDR vocal: 0.0000 Debug: 60 Instr SDR drums: 0.0000 Debug: 60 Instr SDR guitar: 0.0000 Debug: 60 Instr SDR bass: 0.0000 Debug: 60 Instr SDR piano: 0.0000 Debug: 60 Instr SDR synth: 0.0000 Debug: 60 SDR Avg: 0.0000 Store weights: results3/model_htdemucs_ep_1_sdr_0.0000.ckpt Train epoch: 2 Learning rate: 8.55e-05

rimb05 avatar Aug 21 '24 17:08 rimb05

And here is my command line:

!accelerate launch Music-Source-Separation-Training/train_accelerate.py
--model_type htdemucs
--config_path config.yaml
--results_path results3
--data_path training_output
--valid_path training_output_eval
--dataset_type 1
--num_workers 4
--device_ids 0 1 2 3 4 5

rimb05 avatar Aug 21 '24 17:08 rimb05

Any ideas? When it runs, it is much faster, so it would be great if this could work properly.

rimb05 avatar Aug 23 '24 15:08 rimb05

Sorry. I myself have problems with this script. But I have problems with validation not with training. I had no time to fix it, yet. I will try on the next week.

ZFTurbo avatar Aug 24 '24 07:08 ZFTurbo

Thanks. I can confirm the problem happens with many different models.

rimb05 avatar Aug 27 '24 19:08 rimb05

I did some fixes. The main issue probably was that I lost optimizer.zero_grad(). I has no machine now to test new code. Can you please check it if possible?

UPD: I tested a bit. Looks like it works fine now.

ZFTurbo avatar Aug 28 '24 09:08 ZFTurbo

Thanks, looks good so far!

I had a general question: when training a model with lots of stems, I notice two things:

  • The GPU is only utilized in bursts. The utilization goes up and down from 20-30% to 100%
  • While it's training, it pauses for a few seconds then resumes. It does this throughout the training.

For the pausing, increasing the data workers helps, but doesn't completely solve the problem.

And for the GPU usage, It would be great if there was a way to use the GPU 100% all the time. So my question is: what causes this lack of GPU efficiency? Is it an SSD speed issue, or a processor issue? I'm training with 6 3090 GPUs with P2P enabled, so the GPU to GPU speed is 50GB/s bidirectional, and I am using a RAID 0 array that is 11GB/s. Would improving the CPU or SSD speed help with this?

Thanks!

rimb05 avatar Aug 29 '24 00:08 rimb05

I just upgraded my SSD but didn't see much improvement. I would love to get your insight on where the inefficiencies are occurring in these stem separation models.

rimb05 avatar Aug 31 '24 23:08 rimb05

Some augmentations can also cause slowdowns during training when enabled (in particular pitch-shifting, time-stretching and mp3 encoding) and at least some of them, if not all, are done on CPU.

if you disable them all, is it significantly faster ?

jarredou avatar Aug 31 '24 23:08 jarredou

  1. During training check IO load (command iotop if you in Linux). If your data on SSD and you don't have very big batch size I'm sure it's not a problem.
  2. Check that your batch size is not too big. Sometimes if memory is not enough you will observe big slow down. Reduce batch size a bit and try again.
  3. As @jarredou said try to disable augmentations and check if it faster or not

ZFTurbo avatar Sep 01 '24 06:09 ZFTurbo

Thanks for the help. I tried disabling all augmentations, and it didn't make much difference. My CPU must be fast enough to keep up.

However, I did notice an interesting thing - this issue only happens when I use more than one GPU. If I only train with a single GPU, the utilization is nearly 100% all the time. As soon as I add a second GPU, the utilization goes down. By the time I add 6 GPUs, it's about 50% utilization on average (swings from 0% to 100% periodically). What could be going on?

I checked the IO load with iotop and all the worker threads are using about 1-2% IO. I also upgraded my SSD raid0 array and now I have 23GB/sec, so I don't think that's the bottleneck.

The number of workers is currently at 24 (4 per GPU). I tried larger and it didn't make any difference.

I am testing with the MDX23c model.

rimb05 avatar Sep 01 '24 18:09 rimb05

Try to reduce workers. Make it less 2 or 4.

ZFTurbo avatar Sep 01 '24 18:09 ZFTurbo

when I try this, it only processes 1 or 2 steps then it pauses for a few seconds, then does another 1 or 2, then it pauses.. In order to avoid the pauses I need to increase the workers to at least 16.

rimb05 avatar Sep 01 '24 19:09 rimb05

Did it happen on both version of train scripts (train.py and train_accelerate.py)?

ZFTurbo avatar Sep 02 '24 06:09 ZFTurbo

Yes, it's the same thing on both, but the accelerate version is a little faster.

rimb05 avatar Sep 02 '24 10:09 rimb05

Looks like the problem was augmentations after all. Particularly the pitch and distortion. I didn't realize I had these turned on. Now I see ~20% improvement when running with accelerate and there are no more pauses. I'm also able to get my data workers down to 2 with no problem. Thanks for the help!

rimb05 avatar Sep 04 '24 15:09 rimb05

I do see one problem with the accelerate version though. For some reason, the learning rate decreases after every epoch. Patience is set to 3, but it still decreases the lr every time (even the first time). Might be the way the SDR is being averaged across all processes?

rimb05 avatar Sep 04 '24 15:09 rimb05

I do see one problem with the accelerate version though. For some reason, the learning rate decreases after every epoch. Patience is set to 3, but it still decreases the lr every time (even the first time). Might be the way the SDR is being averaged across all processes?

I couldn't fix it this issue. It's the problem with scheduler. I need to understand how to call it correctly.

ZFTurbo avatar Sep 04 '24 15:09 ZFTurbo

It must be that the scheduler is being called multiple times per epoch (one time for every GPU?). It's the only way I can think of that the LR gets decreased even after one epoch...

rimb05 avatar Sep 04 '24 15:09 rimb05

It must be that the scheduler is being called multiple times per epoch (one time for every GPU?). It's the only way I can think of that the LR gets decreased even after one epoch...

Yes, but when I call it once on main thread LR became different for different GPUs... I need to understand the problem.

ZFTurbo avatar Sep 04 '24 15:09 ZFTurbo

In case you are using audiomentations 0.24.0 for data augmentation and you are observing bottleneck issues: I have improved the speed in audiomentations 0.27.0, 0.31.0, 0.34.1, 0.36.0, 0.36.1 and 0.37.0 (see changelog). Upgrading may help a little bit.

iver56 avatar Sep 05 '24 06:09 iver56

If it's pedalboard's distortion that was slow, I would recommend to fully remove that augmentation as it's also creating huge gain changes while audiomentations has better alternative like tanh that is gain-balanced, sounding more musical and faster. Most useful pedalboard augmentation is the reverb, for anything else I would go with audiomentations first.

jarredou avatar Sep 05 '24 22:09 jarredou