trlx icon indicating copy to clipboard operation
trlx copied to clipboard

Ray Tune sweep does not support multi GPU

Open LouisCastricato opened this issue 2 years ago • 4 comments

🐛 Describe the bug

There is no way to use multiple gpu if you're using Ray Tune, apparently we probably need to wrap ray.train.torch.TorchTrainer for it to work.

It appears this is what pytorch lightning does.

Which trlX version are you using?

trlx==0.3.0

Additional system and package information

No response

LouisCastricato avatar Nov 21 '22 16:11 LouisCastricato

@xwjiang2010 @amogkam thought you might be interested in this.

avnishn avatar Dec 05 '22 17:12 avnishn

Sorry not very familiar with this library and how it can work with Tune. A bit more details can be helpful for me to understand what is the gap here. Thank you!

xwjiang2010 avatar Dec 05 '22 18:12 xwjiang2010

Hi @xwjiang2010 thanks for dropping it! In short we're lacking knowledge how to do hyperparameter optimization with Tune in the least invasive way to our codebase. Currently we have a basic script contributed by @ayulockin which can optimize arbitrary single-threaded training functions: https://github.com/CarperAI/trlx/blob/803f8cf69899737fb34ab7762fa3942791207f08/trlx/sweep.py#L21-L31 https://github.com/CarperAI/trlx/blob/803f8cf69899737fb34ab7762fa3942791207f08/trlx/sweep.py#L106-L110 but will break if some distributed training attempted within them. So far we use accelerate for wrapping distributed training but also planning to add few others frameworks in the near future, so we'd have to stay lean on adding new abstractions. However from ray's documentation to enable distributed hyperparameter optimization we'd have to use either ray lightning plugin or raw TorchTrainer, or maybe there are some more lighter options?

maxreciprocate avatar Dec 06 '22 03:12 maxreciprocate

Yes, the sweep doesn't support multi-GPU and was supposed to be added as a separate PR. However, what's necessary to decide is the framework/tool of choice to do so.

While developing sweep I tried using accelerate to provide multi-gpu training but it didn't work. I had even raised an issue in the accelerate repo (#804).

Since the sweep is taken care of by Ray Tune entirely we should try to give Ray Train a shot IMO. This is the recommended way as per Ray's docs:

image

Using Ray's tune.with_resources we can allocate more than one GPU to each experiment/trial. A separate process will have to launch distributed training using the allocated resources.

cc: @LouisCastricato @reciprocated

ayulockin avatar Dec 06 '22 04:12 ayulockin

circling back around on this.

LouisCastricato avatar Jan 05 '23 17:01 LouisCastricato

What's the plan of action?

ayulockin avatar Jan 05 '23 18:01 ayulockin

Well if we can't get it working we need to abandon using Ray haha. We can probably implement something with similar features to ray tune for our use case tbh, it doesn't seem that difficult.

LouisCastricato avatar Jan 05 '23 18:01 LouisCastricato

Its a hard stop though, if we cant get multigpu/multinode working its an absolute no go.

LouisCastricato avatar Jan 05 '23 18:01 LouisCastricato

Hey @reciprocated, @LouisCastricato, @ayulockin-- circling back on this thread.

As @ayulockin mentions, you can use tune.with_resources to allocate multiple GPUs per trial, but the challenge is that you need a mechanism to launch multiple processes to actually do the distributed training with the allocated resources.

Libraries like Huggingface accelerate don't provide an easy way to do this. This is exactly where Ray Train comes in- it makes it easy to launch multiple training processes in a pythonic fashion and integrates natively with Tune.

You can see an example of using Huggingface accelerate with Ray Train TorchTrainer here: https://docs.ray.io/en/latest/train/examples/transformers/transformers_example.html.

Using Ray Train is probably the lightest weight and easiest way to do distributed training.

Happy to chat more with you guys over a call too.

amogkam avatar Jan 05 '23 20:01 amogkam

Hey thanks for giving us some pointers! Seems like the most that could be achieved with ray train would be DDP handled internally by ray. @ayulockin Maybe a more lightweight option without restrictions, would be to use wandb's own hyperparameter tuning seeing that it supports early stopping/hyperband (I have a PoC which works with distributed resources), not sure which feature set from ray tune would be missed in that case

maxreciprocate avatar Jan 06 '23 12:01 maxreciprocate

Can you share the POC so that I can take a look? @reciprocated

ayulockin avatar Jan 06 '23 14:01 ayulockin

Hey @reciprocated @ayulockin is the main gap that you want to use multi-gpu accelerate with Ray Tune but currently that's not easy to do?

richardliaw avatar Jan 06 '23 19:01 richardliaw

Hey folks, I've created a super quick POC of Ray Train integration which you can see here: https://github.com/Yard1/trlx/tree/ray_train_integration

This POC simply wraps the training function in TorchTrainer and sets necessary environment variables for Accelerate to configure distributed training with PyTorch DDP. Please note that the Accelerate config is not yet supported (in other words, Accelerate will only do DDP right now).

I had to fix some issues with configs and disable wandb due to some issues with it on my local enviroment, but there isn't anything that's stopping it from working with Ray. Furthermore, one issue is that the config file loading that is normally executed when the main function is imported from the example script will be executed on remote workers, which may not have access to the config file. The solution is to load it before starting the Tune run and pass the config data as a part of the configuration.

Yard1 avatar Jan 07 '23 19:01 Yard1

This is quite weird, I am getting notifications that people are replying to this thread but nothing is showing up when I open it.

LouisCastricato avatar Jan 08 '23 14:01 LouisCastricato

Regardless, the main issue is that we need a system that does not rely on adding a new trainer. trlX uses multiple backends; NeMo, accelerate, pure pytorch, with plans for t5x as well.

I think for hyper parameter optimization we need something super light weight that can sit on top of our existing trainers, regardless of what backend we're using. I am not sure this is something ray tune can provide? We need to be able to run all of our own parallelism components on various backends in various environments.

LouisCastricato avatar Jan 08 '23 14:01 LouisCastricato

Ray Tune and Ray Train are lightweight, and support pure PyTorch and most of accelerate options out of the box. Support for more frameworks can be added as well. They are both framework agnostic and can run arbitrary functions. Ray Train essentially boils down to spawning a worker group on a Ray cluster with reporting and fault tolerance logic - all of the communication and training is handled by the training framework.

Yard1 avatar Jan 08 '23 15:01 Yard1

Can we schedule a call for sometime next week? I think there have been some large miscommunications, we would love to get Ray Tune working if it satisfies our needs. cc @ayulockin @reciprocated

LouisCastricato avatar Jan 08 '23 16:01 LouisCastricato

cc @richardliaw

Yard1 avatar Jan 08 '23 16:01 Yard1

I sent you an email, let me know if you got it.

LouisCastricato avatar Jan 08 '23 16:01 LouisCastricato

Looking forward to it!

On Sun, Jan 8, 2023 at 8:20 AM Louis Castricato @.***> wrote:

I sent you an email, let me know if you got it.

— Reply to this email directly, view it on GitHub https://github.com/CarperAI/trlx/issues/104#issuecomment-1374873204, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABCRZZKBUOQFIJ6UT2AXN33WRLSLJANCNFSM6AAAAAASG25QBQ . You are receiving this because you were mentioned.Message ID: @.***>

richardliaw avatar Jan 08 '23 16:01 richardliaw