neuralforecast
neuralforecast copied to clipboard
[common] Issue with multi-gpu and `ddp_spawn` strategy when running predict
What happened + What you expected to happen
The predict method fails with the following error when the model has been trained on multi-gpu with ddp_spawn strategy:
TypeError: vstack(): argument 'tensors' (position 1) must be tuple of Tensors, not NoneType
This seems to be an issue with the PyTorch Lightning Trainer returning None when calling predict with multi-gpu. Looks like there is already an existing fix for this: https://github.com/Nixtla/neuralforecast/pull/391/files
But the issue persists on my side. I was able to resolve it by modifying common/_base_windows.py to drop the strategy argument from my trainer_kwargs.
I'm using:
trainer_kwargs = {
'accelerator': 'gpu',
'devices': 8,
'strategy': 'ddp_spawn' # Distributed Data Parallel strategy
}
Versions / Dependencies
Running this on Sagemaker (AL2, 5.10.215-203.850.amzn2.x86_64)
Python 3.10
torch==2.1.0
pytorch-lightning==2.2.5
neuralforecast==1.7.2
Reproduction script
from utilsforecast.data import generate_series
from neuralforecast import NeuralForecast
from neuralforecast.models import NBEATS
from neuralforecast.losses.pytorch import DistributionLoss
import torch
torch.set_float32_matmul_precision('high')
def main():
series = generate_series(10, min_length=200, max_length=500)
h = 7
valid = series.groupby('unique_id', observed=True).tail(h)
train = series.drop(valid.index)
trainer_kwargs = {
'accelerator': 'gpu',
'devices': 8,
'strategy': 'ddp_spawn'}
models = NBEATS(h=h,
input_size=7,
loss=DistributionLoss(distribution='Poisson', level=[90]),
max_steps=100,
scaler_type='standard',
**trainer_kwargs)
model = NeuralForecast(models=[models], freq='D', )
model.fit(train)
p = model.predict(train)
if __name__ == "__main__":
main()
Issue Severity
Medium: It is a significant difficulty but I can work around it.