DESED_task
DESED_task copied to clipboard
Errors when training with multiple gpus
It worked well when training with single gpu.
But an error about batchsampler occurs when using multiple gpus
AttributeError: 'ConcatDatasetBatchSampler' object has no attribute 'batch_size'
Hi there, If you don't use the batch_sampler, what happens ? (of course your results won't satisfy, but it is just to get more info about the issue)
It is due to lightning defaulting to Distributed Data Parallel (DDP). You have to do some workarounds to make the custom Sampler work with DDP. Can you try with plain DataParallel ? It works on my side with DataParallel.
Set in the YAML file backend: dp
Thanks! @popcornell it works after setting backend=dp
There is another problem during validation using 2 gpus.
Seems that Lightning does not split the filenames
list (it splits however the torch tensor between the GPUs).
Might be a bug of Lightning, I don't know how to fix that easily
https://github.com/PyTorchLightning/pytorch-lightning/issues/1508 seems we have to rewrite the collate_fn
then
@mmuguang an easy fix is to use batch_size = 1 for validation. But then you would probably run evaluation only every X epochs
It is not very easy to fix this with Lightning. I tried to use Speechbrain dataio.batch.PaddedBatch collate_fn but it did not work with DP and Lightning. Also we can't encode the paths as strings as they have different length.. too complicated to use a Tokenizer and then decode them IMO. I'll say we go for batch 1 approach with DP and issue a warning.
@turpaultn opinions on this ?
@popcornell I have given up using Lightning and rewrite the baseline with pytorch. it can split the filenames and work well with dp . The old version of lightning is so difficult to use.
The lightning also has bugs during training with dp. When using 2 gpus, the loss on second gpu become Nan. Perhaps there are only unlabeled audio on it so the supervised loss is None. The final loss is also Nan and the model can't be trained normally.
I think that is expected the batch is divided among GPUs. Yeah unfortunately multi-GPU is broken currently. Probably yes we need to dump Lightning from baseline code but maybe we can do it for next year baseline.
I think rather than ditching lightning altogether, a much simpler solution would be to index "all" audio files in the dataset and return the corresponding ID/index inall getitem() methods of datasets instead of the filepath strings. This way they would be treated as tensors and there wont be any problem with multigpu run.
Thanks @Moadab-AI for the help.
I ve thought about that and there could be a problem because we use ConcatDataset. I am sure it can be made to work but I think it will be hacky. We would have to propagate the original individual datasets to the pl.lightningmodule.
Also IDK but it is maybe already very hacky the code IMO and not very readable by newcomers, especially ones who never had hands-on experience with lightning.
What are your thoughts on this ? I would like to hear your feedback.
My opinion on this:
- We have small models, running them on multiple GPUs does not seem to make that much sense anyway when you know the training time of one model 🤔 If you have multiple GPUs, you can run multiple experiments in parallel.
If there is a need to run on multiple GPUs for whatever reason:
- New dataset bigger
- More complex approach
- Something else... Then I would say let's create a new recipe even with "hacks" for more advanced users 😁
What do you think ?