pytorch-forecasting Optuna seemingly stuck with multiple GPUs

PyTorch-Forecasting version: 0.8.4
PyTorch version: 1.8.0
Python version: 3.8.8
Operating System: CentOS

Expected behavior

I'm working through the Demand forecasting with the Temporal Fusion Transformer and try to run the optimize_hyperparameters part on two GPUs.

Actual behavior

I get some output, but it never finishes. With only a single GPU utilized it finishes within minutes without any issues.

[I 2021-04-13 15:40:26,906] A new study created in memory with name: no-name-e455a085-bb8c-4052-a225-ef363fb68e4c initializing ddp: GLOBAL_RANK: 1, MEMBER: 1/2

Code to reproduce the problem

https://pytorch-forecasting.readthedocs.io/en/latest/tutorials/stallion.html

this works:

study = optimize_hyperparameters(
    train_dataloader,
    val_dataloader,
    model_path="optuna_test",
    n_trials=200,
    max_epochs=50,
    gradient_clip_val_range=(0.01, 1.0),
    hidden_size_range=(8, 128),
    hidden_continuous_size_range=(8, 128),
    attention_head_size_range=(1, 4),
    learning_rate_range=(0.001, 0.1),
    dropout_range=(0.1, 0.3),
    trainer_kwargs=dict(limit_train_batches=30),
    reduce_on_plateau_patience=4,
    use_learning_rate_finder=False,  # use Optuna to find ideal learning rate or use in-built learning rate finder
)

changing this, it doesn't anymore:

    trainer_kwargs=dict(limit_train_batches=30, gpus=2),

Apr 14 '21 09:04 DeastinY

Could you add accelerator="ddp" to the trainer_kwargs?

Apr 17 '21 20:04 jdb78

It runs, but does not use both GPUs.

[I 2021-04-20 16:14:18,058] A new study created in memory with name: no-name-e6dcc64e-75aa-4f8b-8e26-b632835e3df1
GPU available: True, used: True
INFO:lightning:GPU available: True, used: True
TPU available: None, using: 0 TPU cores
INFO:lightning:TPU available: None, using: 0 TPU cores
GPU available: True, used: True
INFO:lightning:GPU available: True, used: True
TPU available: None, using: 0 TPU cores
INFO:lightning:TPU available: None, using: 0 TPU cores
INFO:pytorch_lightning.accelerators.gpu:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
Set SLURM handle signals.
INFO:lightning:Set SLURM handle signals.
Finding best initial lr: 100%|██████████| 100/100 [01:04<00:00,  1.55it/s]
[I 2021-04-20 16:15:46,888] Using learning rate of 0.0224
INFO:pytorch_lightning.accelerators.gpu:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/1
INFO:lightning:initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/1
INFO:root:Added key: store_based_barrier_key:1 to store for rank: 0
Set SLURM handle signals.
INFO:lightning:Set SLURM handle signals.

[... model info removed to declutter ...]

Epoch 0:   0%|          | 1/1520 [00:01<30:08,  1.19s/it, loss=24.1, v_num=0, val_loss=29.60]

INFO:root:Reducer buckets have been rebuilt in this iteration.

Epoch 0:  11%|█▏        | 173/1520 [01:55<15:01,  1.49it/s, loss=11.9, v_num=0, val_loss=29.60, train_loss_step=11.70]

this is the output of NVIDIA-SMI

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:62:00.0 Off |                    0 |
| N/A   49C    P0    79W / 300W |   2870MiB / 16160MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  Off  | 00000000:89:00.0 Off |                    0 |
| N/A   44C    P0    41W / 300W |      3MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

Apr 20 '21 14:04 DeastinY

Strange, does it work when training directly (no hyperparameter tuning) or is PyTorch Lightning also only using one GPU?

Apr 29 '21 11:04 jdb78

I might have the same problems. optimize_hyperparameters() is extremely slow and 2 "threads" (for 2 GPUs) run after each other.

Something that I think is really wrong is that two subprocesses are spawned with the same command line as the original process . I wonder which of the modules does that.

May 02 '21 14:05 jwezel

Strange, does it work when training directly (no hyperparameter tuning) or is PyTorch Lightning also only using one GPU?

Sorry for the delayed response. When training directly it seems to lead data to one GPU and then do nothing.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:62:00.0 Off |                    0 |
| N/A   50C    P0    60W / 300W |   1300MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  Off  | 00000000:89:00.0 Off |                    0 |
| N/A   46C    P0    42W / 300W |      3MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

May 12 '21 11:05 DeastinY

Do you have the same issue with the example here? https://github.com/optuna/optuna-examples/blob/main/pytorch/pytorch_lightning_simple.py I wonder if this is a third party bug. If not, maybe you spot the difference in implementations.

May 15 '21 19:05 jdb78

Running the examples leads to this issue: https://github.com/optuna/optuna-examples/issues/14

May 27 '21 09:05 DeastinY

Hi, I'm Kento Nozawa from the Optuna community. The latest Optuna's PyTorch-lightning callback can handle the distributed training! The minimal example is https://github.com/optuna/optuna-examples/blob/main/pytorch/pytorch_lightning_ddp.py.

Best,

Jan 04 '22 09:01 nzw0301

Hi, I'm Kento Nozawa from the Optuna community. The latest Optuna's PyTorch-lightning callback can handle the distributed training! The minimal example is https://github.com/optuna/optuna-examples/blob/main/pytorch/pytorch_lightning_ddp.py.

In the linked example, DDP spawn is used instead of the typical DDP strategy. Is that absolutely required?

Aug 25 '23 17:08 cody-mar10

I might have the same problems. optimize_hyperparameters() is extremely slow and 2 "threads" (for 2 GPUs) run after each other.

Something that I think is really wrong is that two subprocesses are spawned with the same command line as the original process . I wonder which of the modules does that.

Did you manage to solve this issue? I am trying to use this function in DDP over 2 GPUs but it is very slow and only using 1 GPU? When I use "ddp" in the trainer_kwags it says that each model has different parameters. I tried setting seeds but this did not help. Any help would be greatly appreciated!

Jul 04 '24 14:07 aman1b