flair icon indicating copy to clipboard operation
flair copied to clipboard

[Feature]: Multiple GPU Training for MultiTask Learning

Open zrjohnnyl opened this issue 5 months ago • 3 comments

Problem statement

I don't believe Flair currently supports training a single task using multiple GPUs. Would it be easier to support Multi GPU training for Multi Task Learning?

Solution

For example, if you have two tasks and are doing Multi-task learning, it would be super handy to assign each task its own GPU - like sending task_1 to gpu_1 and task_2 to gpu_2. Ideally, you could just tell your trainer which task goes on which GPU when you're setting everything up.

Additional Context

No response

zrjohnnyl avatar Mar 20 '24 05:03 zrjohnnyl

We are also trying to do this, it would really speed up our training up to 10 fold

mattb-zip avatar Mar 25 '24 21:03 mattb-zip

Hi @zrjohnnyl The multitask learning is not different to other models in the training loop, hence it would be even more complex to separate the tasks compared to the already very complex data-parallel approach. So far Flair has some architecture decisions made that make it hard to apply multi GPU training, and we haven't come to any solution that everyone agrees on. I don't think that we will find a good solution in the near future.

helpmefindaname avatar Mar 29 '24 12:03 helpmefindaname

Hi @helpmefindaname

I'm interested in contributing to this effort if feasible. Could you offer some guidance on how to begin tackling this issue? I noticed there were discussions for adding multi-GPU training capabilities for the language model training, including some pull requests.

A straightforward first step might be to explore implementing federated averaging for Multi GPU Training, as outlined here. https://www.educative.io/answers/what-is-federated-averaging-fedavg

The basic idea is partitioning the data across n partitions and train a Flair model on each partition with a different GPU. At the end of each epoch, we could gather the model weights from each GPU and store them on a central machine and average them. Then we pass the averaged weights back to each GPU. While averaging gradients might yield better results but method is much easier to implement.

zrjohnnyl avatar Apr 03 '24 22:04 zrjohnnyl