ray [AIR] [Train] train multiple instances simultaneously on machines with specified tags

What happened + What you expected to happen

I want to train multiple instances simultaneously and place the trainers on machines with specified tags. However, it is possible that a trainer with the tag "machine_for_GPU" might run on a machine tagged "machine_for_CPU" in the Ray cluster, and similarly a trainer with the tag "machine_for_CPU" might run on a machine tagged "machine_for_GPU" in the Ray cluster.
The expectation is that all trainers with the tag "machine_for_GPU" should be able to train on machines with the "machine_for_GPU" tag, while trainers with the tag "machine_for_CPU" should be able to train on machines with the "machine_for_CPU" tag.

ray status

result

Versions / Dependencies

ray==2.3.1 python==3.8.16

Reproduction script

import ray
from ray import train
from ray.train.torch import TorchTrainer
from ray.air.config import ScalingConfig
from ray.air.config import RunConfig
from ray.air.config import CheckpointConfig
import sys
@ray.remote(num_cpus=0)
def ray_remote_training(use_gpu, resources_per_worker):
    task = {}
    task['use_gpu'] = use_gpu
    scaling_config = ScalingConfig(
        num_workers = 1, 
        use_gpu = use_gpu,
        _max_cpu_fraction_per_node = 0.8,
        resources_per_worker = resources_per_worker,
    )
    run_config = RunConfig(
        checkpoint_config = CheckpointConfig(num_to_keep=1),
    )
    # trainer
    trainer = TorchTrainer(
        train_loop_per_worker = train_func,
        train_loop_config = task,
        scaling_config = scaling_config,
        run_config = run_config,
    )
    sys.stdout = None
    trainer.fit()
    sys.stdout = sys.stderr

def train_func(config):
    device = train.torch.get_device()
    print("use GPU: {} device: {}".format(config['use_gpu'], device))

def main():
    for i in range(10):
        if i <= 2:
            use_gpu = True
            resources_per_worker = {"machine_for_GPU": 1}
        else:
            use_gpu = False
            resources_per_worker = {"machine_for_CPU": 1}
        ray_remote_training.remote(use_gpu, resources_per_worker)
    input()
    
if __name__ == '__main__':
    main()

Issue Severity

High: It blocks me from completing my task.

May 15 '23 13:05 a123tony39

I traced through the code. There is a limitation of current Ray Tune implementation where tune.with_parameters() uses a cluster wide per-job registry for storing and retrieving parameters. that caused the additional_resources_per_worker parameter from different Tuners to overwrite each other.

Can I ask about your use case? Why do you want to launch multiple Trainer runs yourself? Why not simply use Ray Tune to tune the ScalingConfig?

May 19 '23 14:05 gjoliver

This P2 issue has seen no activity in the past 2 years. It will be closed in 2 weeks as part of ongoing cleanup efforts.

Please comment and remove the pending-cleanup label if you believe this issue should remain open.

Thanks for contributing to Ray!

Jun 17 '25 00:06 cszhu