ray_lightning icon indicating copy to clipboard operation
ray_lightning copied to clipboard

Pytorch Lightning Distributed Accelerators using Ray

Results 66 ray_lightning issues
Sort by recently updated
recently updated
newest added

I am using ray_lightning to distribute training across a 8 node ray cluster with GPU. I am seeing the training performance significantly slow down (by a factor of 2-3) when...

```python def get_trainer(dir, plugins: List[PLUGIN_INPUT], max_epochs: int = 1000, limit_train_batches: int = 10, limit_val_batches: int = 10, callbacks: Optional[List[Callback]] = None, checkpoint_callback: bool = True, **trainer_kwargs) -> Trainer: """Returns a...

The environment requirements: ```python (base) ray@ip-172-31-36-78:~/horovod-gpu/ray_lightning/ray_lightning/examples$ pip list | grep lightning lightning-bolts 0.4.0 pytorch-lightning 1.5.4 ray-lightning 0.2.0 ``` The gpu environment is ```python Thu Jul 14 13:22:18 2022 +-----------------------------------------------------------------------------+ |...

suspect: this probably the optimizer issue, the optimizers like adam and others, they store the first order and second order momentum, this would be messed up the process? Also, if...

```python Epoch 0: 81%|████████ | 759/937 [00:05

```python ray::ImplicitFunc.train() (pid=27359, ip=172.31.59.24, repr=_inner_train) File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/tune/trainable.py", line 360, in train result = self.step() File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/tune/function_runner.py", line 404, in step self._report_thread_runner_error(block=True) File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/tune/function_runner.py", line 574, in _report_thread_runner_error raise e File...

The progressive bar in the `ray_ddp` shows the results: ```python Epoch 0: 3%|▎ | 32/937 [00:00

```python (ci) @JiahaoYao ➜ /workspaces/ray_lightning/ray_lightning/tests (main ✗) $ python -m pytest -v --durations=0 -x test_ddp_sharded.py =========================================================== test session starts =========================================================== platform linux -- Python 3.7.13, pytest-7.1.2, pluggy-1.0.0 -- /home/codespace/.conda/envs/ci/bin/python cachedir:...

```python ../../../../home/codespace/.conda/envs/ci/lib/python3.7/site-packages/torch/utils/tensorboard/__init__.py:5 /home/codespace/.conda/envs/ci/lib/python3.7/site-packages/torch/utils/tensorboard/__init__.py:5: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. tensorboard.__version__ ../../../../home/codespace/.conda/envs/ci/lib/python3.7/site-packages/torch/utils/tensorboard/__init__.py:6 /home/codespace/.conda/envs/ci/lib/python3.7/site-packages/torch/utils/tensorboard/__init__.py:6: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. ) < LooseVersion("1.15"): ray_lightning/tests/test_ddp.py::test_actor_creation[1] /home/codespace/.conda/envs/ci/lib/python3.7/site-packages/torch/distributed/_sharded_tensor/__init__.py:10: DeprecationWarning:...