Use ACME with Ray Tune ASHAScheduler
Hi,
I have experienced an issue running Ray Tune experiments together with ACME distributed SAC training and ASHA scheduler. The idea behind ASHA scheduler is that it will terminate the jobs earlier in order to find the correct hyperparameters. When the single-process experiment is used, the job does not leave any hanging processes after ray terminates a job. Here is an example how the job is started
experiments.run_experiment(
experiment=experiment,
eval_every=1_000,
num_eval_episodes=1)
On the contrary when the job is started in multi-processing way, the termination of job, done by ray, does not affect the processes that launchpad had spawned. So, what happens is that the training ACME job continues running instead of being terminated. Here is an example how the job function is defined and ray tune config.
def create_and_run_program(config):
experiment = build_experiment_config(config)
program = experiments.make_distributed_experiment(
experiment=experiment,
num_actors=1,
)
lp.launch(program, lp.LaunchType.LOCAL_MULTI_PROCESSING)
def train_function(config):
p = mp.Process(
target=create_and_run_program,
args=(config,))
p.start()
p.join() # this blocks until the process terminates
trainable_with_cpu_gpu = tune.with_resources(train_function, {"cpu": 16 , "gpu": 1})
asha_scheduler = ASHAScheduler(
time_attr="training_iteration",
max_t=100000,
grace_period=20000,
reduction_factor=4,
brackets=1,
)
tuner = tune.Tuner(
trainable_with_cpu_gpu,
tune_config=tune.TuneConfig(
scheduler=asha_scheduler,
metric="rewards_episode",
mode="max",
num_samples=100,
reuse_actors=False
),
param_space=config_space,
)
results = tuner.fit()
My question is there a way to somehow forward the termination signal from ray (when it terminates its job) to all the node processes?
Thank you in advance.