ray_lightning
ray_lightning copied to clipboard
`ray_ddp` gpu issue
ray::ImplicitFunc.train() (pid=27359, ip=172.31.59.24, repr=_inner_train)
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/tune/trainable.py", line 360, in train
result = self.step()
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/tune/function_runner.py", line 404, in step
self._report_thread_runner_error(block=True)
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/tune/function_runner.py", line 574, in _report_thread_runner_error
raise e
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/tune/function_runner.py", line 277, in run
self._entrypoint()
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/tune/function_runner.py", line 349, in entrypoint
return self._trainable_func(
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/tune/function_runner.py", line 645, in _trainable_func
output = fn()
File "test_tune.py", line 37, in _inner_train
trainer.fit(model)
File "/home/ray/anaconda3/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 770, in fit
self._call_and_handle_interrupt(
File "/home/ray/anaconda3/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 721, in _call_and_handle_interrupt
return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs)
File "/home/ray/default/ray_lightning/ray_lightning/launchers/ray_launcher.py", line 62, in launch
ray_output = self.run_function_on_workers(
File "/home/ray/default/ray_lightning/ray_lightning/launchers/ray_launcher.py", line 224, in run_function_on_workers
results = process_results(self._futures, self.tune_queue)
File "/home/ray/default/ray_lightning/ray_lightning/util.py", line 62, in process_results
ray.get(ready)
ray.exceptions.RayTaskError(RuntimeError): ray::RayExecutor.execute() (pid=27475, ip=172.31.59.24, repr=<ray_lightning.launchers.ray_launcher.RayExecutor object at 0x7f2c3c105610>)
File "/home/ray/default/ray_lightning/ray_lightning/launchers/ray_launcher.py", line 356, in execute
return fn(*args, **kwargs)
File "/home/ray/default/ray_lightning/ray_lightning/launchers/ray_launcher.py", line 256, in _wrapping_function
self._strategy.set_cuda_device_if_used()
File "/home/ray/default/ray_lightning/ray_lightning/ray_ddp.py", line 233, in set_cuda_device_if_used
torch.cuda.set_device(self.root_device)
File "/home/ray/anaconda3/lib/python3.8/site-packages/torch/cuda/__init__.py", line 264, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
gives CUDA error: invalid device ordinal
def tune_test(dir, strategy):
callbacks = [TuneReportCallback(on="validation_end")]
analysis = tune.run(
train_func(dir, strategy, callbacks=callbacks),
config={"max_epochs": tune.choice([1, 2, 3])},
resources_per_trial=get_tune_resources(
num_workers=strategy.num_workers, use_gpu=strategy.use_gpu),
num_samples=2)
assert all(analysis.results_df["training_iteration"] ==
analysis.results_df["config.max_epochs"])
def test_tune_iteration_ddp():
"""Tests if each RayStrategy runs the correct number of iterations."""
tmpdir = './'
strategy = RayStrategy(num_workers=2, use_gpu=True)
tune_test(tmpdir, strategy)
this is the code to reproduce the error.
https://github.com/Lightning-AI/lightning/issues/2407
it seems like the gpu id issue? can not assign torch.cuda.set_device