[AIR] [Train] train multiple instances simultaneously on machines with specified tags
What happened + What you expected to happen
-
I want to train multiple instances simultaneously and place the trainers on machines with specified tags. However, it is possible that a trainer with the tag "machine_for_GPU" might run on a machine tagged "machine_for_CPU" in the Ray cluster, and similarly a trainer with the tag "machine_for_CPU" might run on a machine tagged "machine_for_GPU" in the Ray cluster.
-
The expectation is that all trainers with the tag "machine_for_GPU" should be able to train on machines with the "machine_for_GPU" tag, while trainers with the tag "machine_for_CPU" should be able to train on machines with the "machine_for_CPU" tag.
ray status
result
Versions / Dependencies
ray==2.3.1 python==3.8.16
Reproduction script
import ray
from ray import train
from ray.train.torch import TorchTrainer
from ray.air.config import ScalingConfig
from ray.air.config import RunConfig
from ray.air.config import CheckpointConfig
import sys
@ray.remote(num_cpus=0)
def ray_remote_training(use_gpu, resources_per_worker):
task = {}
task['use_gpu'] = use_gpu
scaling_config = ScalingConfig(
num_workers = 1,
use_gpu = use_gpu,
_max_cpu_fraction_per_node = 0.8,
resources_per_worker = resources_per_worker,
)
run_config = RunConfig(
checkpoint_config = CheckpointConfig(num_to_keep=1),
)
# trainer
trainer = TorchTrainer(
train_loop_per_worker = train_func,
train_loop_config = task,
scaling_config = scaling_config,
run_config = run_config,
)
sys.stdout = None
trainer.fit()
sys.stdout = sys.stderr
def train_func(config):
device = train.torch.get_device()
print("use GPU: {} device: {}".format(config['use_gpu'], device))
def main():
for i in range(10):
if i <= 2:
use_gpu = True
resources_per_worker = {"machine_for_GPU": 1}
else:
use_gpu = False
resources_per_worker = {"machine_for_CPU": 1}
ray_remote_training.remote(use_gpu, resources_per_worker)
input()
if __name__ == '__main__':
main()
Issue Severity
High: It blocks me from completing my task.
I traced through the code. There is a limitation of current Ray Tune implementation where tune.with_parameters() uses a cluster wide per-job registry for storing and retrieving parameters.
that caused the additional_resources_per_worker parameter from different Tuners to overwrite each other.
Can I ask about your use case? Why do you want to launch multiple Trainer runs yourself? Why not simply use Ray Tune to tune the ScalingConfig?
This P2 issue has seen no activity in the past 2 years. It will be closed in 2 weeks as part of ongoing cleanup efforts.
Please comment and remove the pending-cleanup label if you believe this issue should remain open.
Thanks for contributing to Ray!