trlx
trlx copied to clipboard
Ray Tune sweep does not support multi GPU
🐛 Describe the bug
There is no way to use multiple gpu if you're using Ray Tune, apparently we probably need to wrap ray.train.torch.TorchTrainer for it to work.
It appears this is what pytorch lightning does.
Which trlX version are you using?
trlx==0.3.0
Additional system and package information
No response
@xwjiang2010 @amogkam thought you might be interested in this.
Sorry not very familiar with this library and how it can work with Tune. A bit more details can be helpful for me to understand what is the gap here. Thank you!
Hi @xwjiang2010 thanks for dropping it! In short we're lacking knowledge how to do hyperparameter optimization with Tune in the least invasive way to our codebase. Currently we have a basic script contributed by @ayulockin which can optimize arbitrary single-threaded training functions: https://github.com/CarperAI/trlx/blob/803f8cf69899737fb34ab7762fa3942791207f08/trlx/sweep.py#L21-L31 https://github.com/CarperAI/trlx/blob/803f8cf69899737fb34ab7762fa3942791207f08/trlx/sweep.py#L106-L110 but will break if some distributed training attempted within them. So far we use accelerate for wrapping distributed training but also planning to add few others frameworks in the near future, so we'd have to stay lean on adding new abstractions. However from ray's documentation to enable distributed hyperparameter optimization we'd have to use either ray lightning plugin or raw TorchTrainer, or maybe there are some more lighter options?
Yes, the sweep doesn't support multi-GPU and was supposed to be added as a separate PR. However, what's necessary to decide is the framework/tool of choice to do so.
While developing sweep I tried using accelerate to provide multi-gpu training but it didn't work. I had even raised an issue in the accelerate repo (#804).
Since the sweep is taken care of by Ray Tune entirely we should try to give Ray Train a shot IMO. This is the recommended way as per Ray's docs:

Using Ray's tune.with_resources
we can allocate more than one GPU to each experiment/trial. A separate process will have to launch distributed training using the allocated resources.
cc: @LouisCastricato @reciprocated
circling back around on this.
What's the plan of action?
Well if we can't get it working we need to abandon using Ray haha. We can probably implement something with similar features to ray tune for our use case tbh, it doesn't seem that difficult.
Its a hard stop though, if we cant get multigpu/multinode working its an absolute no go.
Hey @reciprocated, @LouisCastricato, @ayulockin-- circling back on this thread.
As @ayulockin mentions, you can use tune.with_resources
to allocate multiple GPUs per trial, but the challenge is that you need a mechanism to launch multiple processes to actually do the distributed training with the allocated resources.
Libraries like Huggingface accelerate don't provide an easy way to do this. This is exactly where Ray Train comes in- it makes it easy to launch multiple training processes in a pythonic fashion and integrates natively with Tune.
You can see an example of using Huggingface accelerate with Ray Train TorchTrainer here: https://docs.ray.io/en/latest/train/examples/transformers/transformers_example.html.
Using Ray Train is probably the lightest weight and easiest way to do distributed training.
Happy to chat more with you guys over a call too.
Hey thanks for giving us some pointers! Seems like the most that could be achieved with ray train would be DDP handled internally by ray. @ayulockin Maybe a more lightweight option without restrictions, would be to use wandb's own hyperparameter tuning seeing that it supports early stopping/hyperband (I have a PoC which works with distributed resources), not sure which feature set from ray tune would be missed in that case
Can you share the POC so that I can take a look? @reciprocated
Hey @reciprocated @ayulockin is the main gap that you want to use multi-gpu accelerate with Ray Tune but currently that's not easy to do?
Hey folks, I've created a super quick POC of Ray Train integration which you can see here: https://github.com/Yard1/trlx/tree/ray_train_integration
This POC simply wraps the training function in TorchTrainer
and sets necessary environment variables for Accelerate to configure distributed training with PyTorch DDP. Please note that the Accelerate config is not yet supported (in other words, Accelerate will only do DDP right now).
I had to fix some issues with configs and disable wandb due to some issues with it on my local enviroment, but there isn't anything that's stopping it from working with Ray. Furthermore, one issue is that the config file loading that is normally executed when the main
function is imported from the example script will be executed on remote workers, which may not have access to the config file. The solution is to load it before starting the Tune run and pass the config data as a part of the configuration.
This is quite weird, I am getting notifications that people are replying to this thread but nothing is showing up when I open it.
Regardless, the main issue is that we need a system that does not rely on adding a new trainer. trlX uses multiple backends; NeMo, accelerate, pure pytorch, with plans for t5x as well.
I think for hyper parameter optimization we need something super light weight that can sit on top of our existing trainers, regardless of what backend we're using. I am not sure this is something ray tune can provide? We need to be able to run all of our own parallelism components on various backends in various environments.
Ray Tune and Ray Train are lightweight, and support pure PyTorch and most of accelerate options out of the box. Support for more frameworks can be added as well. They are both framework agnostic and can run arbitrary functions. Ray Train essentially boils down to spawning a worker group on a Ray cluster with reporting and fault tolerance logic - all of the communication and training is handled by the training framework.
Can we schedule a call for sometime next week? I think there have been some large miscommunications, we would love to get Ray Tune working if it satisfies our needs. cc @ayulockin @reciprocated
cc @richardliaw
I sent you an email, let me know if you got it.
Looking forward to it!
On Sun, Jan 8, 2023 at 8:20 AM Louis Castricato @.***> wrote:
I sent you an email, let me know if you got it.
— Reply to this email directly, view it on GitHub https://github.com/CarperAI/trlx/issues/104#issuecomment-1374873204, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABCRZZKBUOQFIJ6UT2AXN33WRLSLJANCNFSM6AAAAAASG25QBQ . You are receiving this because you were mentioned.Message ID: @.***>