Reimplementing DDP
Is your feature request related to a problem? Please describe.
In haystack, there are remnants of Farm's DDP implementation (e.g. WrappedDDP). However, on higher abstraction layers, this feature is mostly disabled (e.g. distributed=False).
haystack also implements DataParallel which is similary to Distributed Data Parallel to some degree, but has some disadvantages (see comparison here). Especially with small batch sizes these can be quite extreme. For example, when fine-tuning a deberta-v3-large model, using 4 GPUs is currently slightly slower than using 1 GPU just because of the overhead.
Describe the solution you'd like A clear and concise description of what you want to happen.
Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.
Additional context Add any other context or screenshots about the feature request here.
Hi @MichelBartels could you provide some more details on what is currently missing in Haystack's implementation of DDP compared to FARM's implementation? For example, some code snippet comparisons would be helpful.
Hi, I don't have that much experience with DDP, but as far as I know most should be already implemented. So it should be mostly about passing down a distributed parameter to the initialize_optimizer method and the DataSilo.
Also, it would then probably make sense to create an example script similar to this one about how to make it work with something like torch.distributed.launch.