ray_lightning
ray_lightning copied to clipboard
Pytorch Lightning Distributed Accelerators using Ray
```python /home/ray/anaconda3/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/callback_connector.py:151: LightningDeprecationWarning: Setting `Trainer(checkpoint_callback=True)` is deprecated in v1.5 and will be removed in v1.7. Please consider using `Trainer(enable_checkpointing=True)`. ```
```python (BaseHorovodWorker pid=379, ip=172.31.46.122) Missing logger folder: /home/ray/default/ray_lightning/ray_lightning/tests/lightning_logs ```
Using the flag to install horovod but met with the following issues. ```python (tensorflow2_p38) ubuntu@ip-10-0-2-36:~/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/ray_lightning$ HOROVOD_WITH_TENSORFLOW=1 HOROVOD_WITH_TORCH=1 HOROVOD_WITH_GLOO=1 pip install --no-cache-dir horovod[tensorflow] horovod[ray] horovod[torch] Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com Collecting...
Currently the master branch supports TL 1.5. What's the plan and timeline regarding TL 1.6? Also, we want to utilize distributed HPO with each trial being distributed itself, and found...
When I use the ray lightning plugin for distributed training, I see two wandb experiments created. One which never logs anything (but has configs that were updated before calling `pl.Trainer.fit`),...
When using PBT/PB2, I received the following error: ```shell RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! ``` This...
### Search before asking - [X] I had searched in the [issues](https://github.com/ray-project/ray/issues) and found no similar feature requirement. ### Description Hello and happy that you show integration with PL! :tada:...
When trained on gpu, I see the following warning: ```shell 2022-05-03 00:02:20,033 WARNING tune.py:637 -- Tune detects GPUs, but no trials are using GPUs. To enable trials to use GPUs,...
Running Ray Lightning with Tune has led to various confusions with how resources are handled (https://github.com/ray-project/ray_lightning/issues/138, https://github.com/ray-project/ray_lightning/issues/23). Currently, the Tune trainable process does not do any training and does not...
I tried to use ray lightning + ray tune to do distributed HPO and found GPUs are not available within the trial even when I set `use_gpu` to be True...