neuralforecast [common] Issue with multi-gpu and `ddp_spawn` strategy when running predict

[common] Issue with multi-gpu and `ddp_spawn` strategy when running predict

Open matthieuhumeau opened this issue 1 year ago • 2 comments

What happened + What you expected to happen

The predict method fails with the following error when the model has been trained on multi-gpu with ddp_spawn strategy: TypeError: vstack(): argument 'tensors' (position 1) must be tuple of Tensors, not NoneType

This seems to be an issue with the PyTorch Lightning Trainer returning None when calling predict with multi-gpu. Looks like there is already an existing fix for this: https://github.com/Nixtla/neuralforecast/pull/391/files But the issue persists on my side. I was able to resolve it by modifying common/_base_windows.py to drop the strategy argument from my trainer_kwargs.

I'm using:

trainer_kwargs = {
        'accelerator': 'gpu',
        'devices': 8, 
        'strategy': 'ddp_spawn'  # Distributed Data Parallel strategy
    }

Versions / Dependencies

Running this on Sagemaker (AL2, 5.10.215-203.850.amzn2.x86_64) Python 3.10 torch==2.1.0 pytorch-lightning==2.2.5 neuralforecast==1.7.2

Reproduction script

from utilsforecast.data import generate_series
from neuralforecast import NeuralForecast
from neuralforecast.models import NBEATS 
from neuralforecast.losses.pytorch import DistributionLoss
import torch
torch.set_float32_matmul_precision('high')


def main():
    series = generate_series(10, min_length=200, max_length=500)
    h = 7
    valid = series.groupby('unique_id', observed=True).tail(h)
    train = series.drop(valid.index)
    
    trainer_kwargs = {
            'accelerator': 'gpu',
            'devices': 8,
            'strategy': 'ddp_spawn'}

    models = NBEATS(h=h,
                    input_size=7,
                    loss=DistributionLoss(distribution='Poisson', level=[90]),
                    max_steps=100,
                    scaler_type='standard',
                    **trainer_kwargs)

    model = NeuralForecast(models=[models], freq='D', )
    model.fit(train)
    
    p = model.predict(train)


if __name__ == "__main__":
    main()

Issue Severity

Medium: It is a significant difficulty but I can work around it.

Jun 13 '24 14:06 matthieuhumeau

neuralforecast neuralforecast copied to clipboard

[common] Issue with multi-gpu and `ddp_spawn` strategy when running predict

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

neuralforecast
neuralforecast copied to clipboard