ray_lightning icon indicating copy to clipboard operation
ray_lightning copied to clipboard

[Tune] Run rank 0 worker in main process when used with Tune

Open amogkam opened this issue 2 years ago • 0 comments

Running Ray Lightning with Tune has led to various confusions with how resources are handled (https://github.com/ray-project/ray_lightning/issues/138, https://github.com/ray-project/ray_lightning/issues/23).

Currently, the Tune trainable process does not do any training and does not reserve any GPUs. However, since Pytorch Lightning does not support heterogenous setups (CPU only driver with GPU workers) this means that certain "GPU only" functionality will not be available when using with Tune (https://github.com/ray-project/ray_lightning/issues/127, https://github.com/ray-project/ray_lightning/issues/99).

One solution is to reserve an entire GPU for the trainable and actually have the trainable perform training (like a local mode style data parallel training)

amogkam avatar Apr 18 '22 20:04 amogkam