ray icon indicating copy to clipboard operation
ray copied to clipboard

[AIR] [Train] train multiple instances simultaneously on machines with specified tags

Open a123tony39 opened this issue 2 years ago • 2 comments

What happened + What you expected to happen

  1. I want to train multiple instances simultaneously and place the trainers on machines with specified tags. However, it is possible that a trainer with the tag "machine_for_GPU" might run on a machine tagged "machine_for_CPU" in the Ray cluster, and similarly a trainer with the tag "machine_for_CPU" might run on a machine tagged "machine_for_GPU" in the Ray cluster.

  2. The expectation is that all trainers with the tag "machine_for_GPU" should be able to train on machines with the "machine_for_GPU" tag, while trainers with the tag "machine_for_CPU" should be able to train on machines with the "machine_for_CPU" tag.

ray status image

result image

Versions / Dependencies

ray==2.3.1 python==3.8.16

Reproduction script

import ray
from ray import train
from ray.train.torch import TorchTrainer
from ray.air.config import ScalingConfig
from ray.air.config import RunConfig
from ray.air.config import CheckpointConfig
import sys
@ray.remote(num_cpus=0)
def ray_remote_training(use_gpu, resources_per_worker):
    task = {}
    task['use_gpu'] = use_gpu
    scaling_config = ScalingConfig(
        num_workers = 1, 
        use_gpu = use_gpu,
        _max_cpu_fraction_per_node = 0.8,
        resources_per_worker = resources_per_worker,
    )
    run_config = RunConfig(
        checkpoint_config = CheckpointConfig(num_to_keep=1),
    )
    # trainer
    trainer = TorchTrainer(
        train_loop_per_worker = train_func,
        train_loop_config = task,
        scaling_config = scaling_config,
        run_config = run_config,
    )
    sys.stdout = None
    trainer.fit()
    sys.stdout = sys.stderr

def train_func(config):
    device = train.torch.get_device()
    print("use GPU: {} device: {}".format(config['use_gpu'], device))

def main():
    for i in range(10):
        if i <= 2:
            use_gpu = True
            resources_per_worker = {"machine_for_GPU": 1}
        else:
            use_gpu = False
            resources_per_worker = {"machine_for_CPU": 1}
        ray_remote_training.remote(use_gpu, resources_per_worker)
    input()
    
if __name__ == '__main__':
    main()

Issue Severity

High: It blocks me from completing my task.

a123tony39 avatar May 15 '23 13:05 a123tony39

I traced through the code. There is a limitation of current Ray Tune implementation where tune.with_parameters() uses a cluster wide per-job registry for storing and retrieving parameters. that caused the additional_resources_per_worker parameter from different Tuners to overwrite each other.

Can I ask about your use case? Why do you want to launch multiple Trainer runs yourself? Why not simply use Ray Tune to tune the ScalingConfig?

gjoliver avatar May 19 '23 14:05 gjoliver

This P2 issue has seen no activity in the past 2 years. It will be closed in 2 weeks as part of ongoing cleanup efforts.

Please comment and remove the pending-cleanup label if you believe this issue should remain open.

Thanks for contributing to Ray!

cszhu avatar Jun 17 '25 00:06 cszhu