DESED_task icon indicating copy to clipboard operation
DESED_task copied to clipboard

Errors when training with multiple gpus

Open mmuguang opened this issue 2 years ago • 17 comments

It worked well when training with single gpu. But an error about batchsampler occurs when using multiple gpus image image

mmuguang avatar Apr 07 '22 11:04 mmuguang

AttributeError: 'ConcatDatasetBatchSampler' object has no attribute 'batch_size'

mmuguang avatar Apr 07 '22 11:04 mmuguang

Hi there, If you don't use the batch_sampler, what happens ? (of course your results won't satisfy, but it is just to get more info about the issue)

turpaultn avatar Apr 08 '22 08:04 turpaultn

It is due to lightning defaulting to Distributed Data Parallel (DDP). You have to do some workarounds to make the custom Sampler work with DDP. Can you try with plain DataParallel ? It works on my side with DataParallel.

Set in the YAML file backend: dp

popcornell avatar Apr 08 '22 12:04 popcornell

Thanks! @popcornell it works after setting backend=dp

mmuguang avatar Apr 08 '22 12:04 mmuguang

There is another problem during validation using 2 gpus. image image

mmuguang avatar Apr 08 '22 13:04 mmuguang

Seems that Lightning does not split the filenames list (it splits however the torch tensor between the GPUs).

popcornell avatar Apr 08 '22 13:04 popcornell

Might be a bug of Lightning, I don't know how to fix that easily

popcornell avatar Apr 08 '22 13:04 popcornell

https://github.com/PyTorchLightning/pytorch-lightning/issues/1508 seems we have to rewrite the collate_fn then

popcornell avatar Apr 08 '22 14:04 popcornell

@mmuguang an easy fix is to use batch_size = 1 for validation. But then you would probably run evaluation only every X epochs

popcornell avatar Apr 12 '22 17:04 popcornell

It is not very easy to fix this with Lightning. I tried to use Speechbrain dataio.batch.PaddedBatch collate_fn but it did not work with DP and Lightning. Also we can't encode the paths as strings as they have different length.. too complicated to use a Tokenizer and then decode them IMO. I'll say we go for batch 1 approach with DP and issue a warning.

popcornell avatar Apr 12 '22 17:04 popcornell

@turpaultn opinions on this ?

popcornell avatar Apr 12 '22 17:04 popcornell

@popcornell I have given up using Lightning and rewrite the baseline with pytorch. it can split the filenames and work well with dp . The old version of lightning is so difficult to use.

mmuguang avatar Apr 13 '22 07:04 mmuguang

The lightning also has bugs during training with dp. When using 2 gpus, the loss on second gpu become Nan. Perhaps there are only unlabeled audio on it so the supervised loss is None. The final loss is also Nan and the model can't be trained normally.

mmuguang avatar Apr 13 '22 07:04 mmuguang

I think that is expected the batch is divided among GPUs. Yeah unfortunately multi-GPU is broken currently. Probably yes we need to dump Lightning from baseline code but maybe we can do it for next year baseline.

popcornell avatar Apr 13 '22 10:04 popcornell

I think rather than ditching lightning altogether, a much simpler solution would be to index "all" audio files in the dataset and return the corresponding ID/index inall getitem() methods of datasets instead of the filepath strings. This way they would be treated as tensors and there wont be any problem with multigpu run.

Moadab-AI avatar May 03 '22 12:05 Moadab-AI

Thanks @Moadab-AI for the help.

I ve thought about that and there could be a problem because we use ConcatDataset. I am sure it can be made to work but I think it will be hacky. We would have to propagate the original individual datasets to the pl.lightningmodule.

Also IDK but it is maybe already very hacky the code IMO and not very readable by newcomers, especially ones who never had hands-on experience with lightning.

What are your thoughts on this ? I would like to hear your feedback.

popcornell avatar May 03 '22 18:05 popcornell

My opinion on this:

  • We have small models, running them on multiple GPUs does not seem to make that much sense anyway when you know the training time of one model 🤔 If you have multiple GPUs, you can run multiple experiments in parallel.

If there is a need to run on multiple GPUs for whatever reason:

  • New dataset bigger
  • More complex approach
  • Something else... Then I would say let's create a new recipe even with "hacks" for more advanced users 😁

What do you think ?

turpaultn avatar Sep 20 '22 15:09 turpaultn