neuralforecast icon indicating copy to clipboard operation
neuralforecast copied to clipboard

Auto models get ray.exceptions.ActorDiedError when run on multi-GPU node

Open bstewart311 opened this issue 11 months ago • 11 comments

What happened + What you expected to happen

I started a multi-gpu node within the AWS SageMaker JupyterLab environment, then git cloned neuralforecast, navigated to the experiments/long_horizon directory, created a conda long_horizon environment using the environment.yml command (as described in the long_horizon/readme.md). Then I ran:

  • python run_nhits.py --dataset 'ETTh1' --horizon 96 --num_samples 1

and I got the following error:

(_train_tune pid=22946) [rank: 1] Child process with PID 23026 terminated with code 1. Forcefully terminating all other processes to avoid zombies 🧟 (raylet) A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffff14b6988a6a4935518fd4985a01000000 Worker ID: fa49156216bca352caee735419dc8cf5a59f668a96d20f17a94ddec9 Node ID: 6eabca3b606628a676f68efc346d601b765017331b5032e290b6b99e Worker IP address: 169.255.255.2 Worker port: 33095 Worker PID: 22946 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors. 2025-03-18 19:36:13,639 ERROR tune_controller.py:1331 -- Trial task failed for trial _train_tune_39116_00000 Traceback (most recent call last): File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future result = ray.get(future) ^^^^^^^^^^^^^^^ File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper return fn(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^ File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/ray/_private/worker.py", line 2771, in get values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/ray/_private/worker.py", line 921, in get_objects raise value ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task. class_name: ImplicitFunc actor_id: 14b6988a6a4935518fd4985a01000000 pid: 22946 namespace: efcdee2a-00c0-40d7-920b-84c180446a48 ip: 169.255.255.2 The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.

I expected the model to train on all 4 GPUs and run to completion. Here is the full error trace across all 4 GPUs:

(long_horizon) sagemaker-user@default:~/Nixtla/neuralforecast/experiments/long_horizon$ python run_nhits.py --dataset 'ETTh1' --horizon 96 --num_samples 1 /home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/ray/tune/impl/tuner_internal.py:125: RayDeprecationWarning: The RunConfig class should be imported from ray.tune when passing it to the Tuner. Please update your imports. See this issue for more context and migration options: https://github.com/ray-project/ray/issues/49454. Disable these warnings by setting the environment variable: RAY_TRAIN_ENABLE_V2_MIGRATION_WARNINGS=0 _log_deprecation_warning( 2025-03-18 19:36:03,203 INFO worker.py:1841 -- Started a local Ray instance. 2025-03-18 19:36:04,295 INFO tune.py:253 -- Initializing Ray automatically. For cluster usage or custom Ray initialization, call ray.init(...) before Tuner(...). ╭────────────────────────────────────────────────────────────────────╮ │ Configuration for experiment _train_tune_2025-03-18_19-36-02 │ ├────────────────────────────────────────────────────────────────────┤ │ Search algorithm BasicVariantGenerator │ │ Scheduler FIFOScheduler │ │ Number of trials 1 │ ╰────────────────────────────────────────────────────────────────────╯

View detailed results here: /home/sagemaker-user/ray_results/_train_tune_2025-03-18_19-36-02 To visualize your results with TensorBoard, run: tensorboard --logdir /tmp/ray/session_2025-03-18_19-36-02_098330_19785/artifacts/2025-03-18_19-36-04/_train_tune_2025-03-18_19-36-02/driver_artifacts (_train_tune pid=22946) /home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/ray/tune/integration/pytorch_lightning.py:198: ray.tune.integration.pytorch_lightning.TuneReportCallback is deprecated. Use ray.tune.integration.pytorch_lightning.TuneReportCheckpointCallback instead. (_train_tune pid=22946) Seed set to 2 (_train_tune pid=22946) GPU available: True (cuda), used: True (_train_tune pid=22946) TPU available: False, using: 0 TPU cores (_train_tune pid=22946) HPU available: False, using: 0 HPUs (_train_tune pid=22946) Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/4 (_train_tune pid=22946) [rank: 1] Child process with PID 23026 terminated with code 1. Forcefully terminating all other processes to avoid zombies 🧟 (raylet) A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffff14b6988a6a4935518fd4985a01000000 Worker ID: fa49156216bca352caee735419dc8cf5a59f668a96d20f17a94ddec9 Node ID: 6eabca3b606628a676f68efc346d601b765017331b5032e290b6b99e Worker IP address: 169.255.255.2 Worker port: 33095 Worker PID: 22946 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors. 2025-03-18 19:36:13,639 ERROR tune_controller.py:1331 -- Trial task failed for trial _train_tune_39116_00000 Traceback (most recent call last): File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future result = ray.get(future) ^^^^^^^^^^^^^^^ File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper return fn(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^ File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/ray/_private/worker.py", line 2771, in get values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/ray/_private/worker.py", line 921, in get_objects raise value ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task. class_name: ImplicitFunc actor_id: 14b6988a6a4935518fd4985a01000000 pid: 22946 namespace: efcdee2a-00c0-40d7-920b-84c180446a48 ip: 169.255.255.2 The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.

Trial _train_tune_39116_00000 errored after 0 iterations at 2025-03-18 19:36:13. Total running time: 9s Error file: /tmp/ray/session_2025-03-18_19-36-02_098330_19785/artifacts/2025-03-18_19-36-04/_train_tune_2025-03-18_19-36-02/driver_artifacts/_train_tune_39116_00000_0_activation=ReLU,batch_size=7,dropout_prob_theta=0.5000,input_size=672,interpolation_mode=linear,learning_2025-03-18_19-36-04/error.txt 2025-03-18 19:36:13,646 INFO tune.py:1009 -- Wrote the latest version of all result files and experiment state to '/home/sagemaker-user/ray_results/_train_tune_2025-03-18_19-36-02' in 0.0041s.

2025-03-18 19:36:13,647 ERROR tune.py:1037 -- Trials did not complete: [_train_tune_39116_00000] Seed set to 2 Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/4 /home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/ray/tune/impl/tuner_internal.py:125: RayDeprecationWarning: The RunConfig class should be imported from ray.tune when passing it to the Tuner. Please update your imports. See this issue for more context and migration options: https://github.com/ray-project/ray/issues/49454. Disable these warnings by setting the environment variable: RAY_TRAIN_ENABLE_V2_MIGRATION_WARNINGS=0 _log_deprecation_warning( /home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/ray/tune/impl/tuner_internal.py:125: RayDeprecationWarning: The RunConfig class should be imported from ray.tune when passing it to the Tuner. Please update your imports. See this issue for more context and migration options: https://github.com/ray-project/ray/issues/49454. Disable these warnings by setting the environment variable: RAY_TRAIN_ENABLE_V2_MIGRATION_WARNINGS=0 _log_deprecation_warning( /home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/ray/tune/impl/tuner_internal.py:125: RayDeprecationWarning: The RunConfig class should be imported from ray.tune when passing it to the Tuner. Please update your imports. See this issue for more context and migration options: https://github.com/ray-project/ray/issues/49454. Disable these warnings by setting the environment variable: RAY_TRAIN_ENABLE_V2_MIGRATION_WARNINGS=0 _log_deprecation_warning( 2025-03-18 19:36:19,745 INFO worker.py:1841 -- Started a local Ray instance. 2025-03-18 19:36:19,755 INFO worker.py:1841 -- Started a local Ray instance. 2025-03-18 19:36:19,786 INFO worker.py:1841 -- Started a local Ray instance. 2025-03-18 19:36:21,623 INFO tune.py:253 -- Initializing Ray automatically. For cluster usage or custom Ray initialization, call ray.init(...) before Tuner(...). ╭────────────────────────────────────────────────────────────────────╮ │ Configuration for experiment _train_tune_2025-03-18_19-36-18 │ ├────────────────────────────────────────────────────────────────────┤ │ Search algorithm BasicVariantGenerator │ │ Scheduler FIFOScheduler │ │ Number of trials 1 │ ╰────────────────────────────────────────────────────────────────────╯

View detailed results here: /home/sagemaker-user/ray_results/_train_tune_2025-03-18_19-36-18 To visualize your results with TensorBoard, run: tensorboard --logdir /tmp/ray/session_2025-03-18_19-36-18_635368_23291/artifacts/2025-03-18_19-36-21/_train_tune_2025-03-18_19-36-18/driver_artifacts 2025-03-18 19:36:21,674 INFO tune.py:253 -- Initializing Ray automatically. For cluster usage or custom Ray initialization, call ray.init(...) before Tuner(...). ╭────────────────────────────────────────────────────────────────────╮ │ Configuration for experiment _train_tune_2025-03-18_19-36-18 │ ├────────────────────────────────────────────────────────────────────┤ │ Search algorithm BasicVariantGenerator │ │ Scheduler FIFOScheduler │ │ Number of trials 1 │ ╰────────────────────────────────────────────────────────────────────╯

View detailed results here: /home/sagemaker-user/ray_results/_train_tune_2025-03-18_19-36-18 To visualize your results with TensorBoard, run: tensorboard --logdir /tmp/ray/session_2025-03-18_19-36-18_645623_23290/artifacts/2025-03-18_19-36-21/_train_tune_2025-03-18_19-36-18/driver_artifacts 2025-03-18 19:36:21,956 INFO tune.py:253 -- Initializing Ray automatically. For cluster usage or custom Ray initialization, call ray.init(...) before Tuner(...). ╭────────────────────────────────────────────────────────────────────╮ │ Configuration for experiment _train_tune_2025-03-18_19-36-18 │ ├────────────────────────────────────────────────────────────────────┤ │ Search algorithm BasicVariantGenerator │ │ Scheduler FIFOScheduler │ │ Number of trials 1 │ ╰────────────────────────────────────────────────────────────────────╯

View detailed results here: /home/sagemaker-user/ray_results/_train_tune_2025-03-18_19-36-18 To visualize your results with TensorBoard, run: tensorboard --logdir /tmp/ray/session_2025-03-18_19-36-18_682673_23289/artifacts/2025-03-18_19-36-21/_train_tune_2025-03-18_19-36-18/driver_artifacts (_train_tune pid=32621) /home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/ray/tune/integration/pytorch_lightning.py:198: ray.tune.integration.pytorch_lightning.TuneReportCallback is deprecated. Use ray.tune.integration.pytorch_lightning.TuneReportCheckpointCallback instead. (_train_tune pid=32621) [rank: 2] Seed set to 2 (_train_tune pid=32554) /home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/ray/tune/integration/pytorch_lightning.py:198: ray.tune.integration.pytorch_lightning.TuneReportCallback is deprecated. Use ray.tune.integration.pytorch_lightning.TuneReportCheckpointCallback instead. (_train_tune pid=32554) [rank: 3] Seed set to 2 (_train_tune pid=32822) /home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/ray/tune/integration/pytorch_lightning.py:198: ray.tune.integration.pytorch_lightning.TuneReportCallback is deprecated. Use ray.tune.integration.pytorch_lightning.TuneReportCheckpointCallback instead. (_train_tune pid=32822) [rank: 1] Seed set to 2 (_train_tune pid=32621) Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/4 (_train_tune pid=32554) Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/4 (_train_tune pid=32822) Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/4 (_train_tune pid=32554) LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3] Sanity Checking DataLoader 0: 0%| | 0/1 [00:00<?, ?it/s](_train_tune pid=32621) LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3] (_train_tune pid=32822) LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3] Epoch 999: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 21.94it/s, v_num=2, train_loss_step=0.126, train_loss_epoch=0.126, valid_loss=0.537] 2025-03-18 19:36:48,017 INFO tune.py:1009 -- Wrote the latest version of all result files and experiment state to '/home/sagemaker-user/ray_results/_train_tune_2025-03-18_19-36-18' in 0.0040s.
2025-03-18 19:36:48,018 INFO tune.py:1009 -- Wrote the latest version of all result files and experiment state to '/home/sagemaker-user/ray_results/_train_tune_2025-03-18_19-36-18' in 0.0036s. 2025-03-18 19:36:48,018 INFO tune.py:1009 -- Wrote the latest version of all result files and experiment state to '/home/sagemaker-user/ray_results/_train_tune_2025-03-18_19-36-18' in 0.0042s.

[rank: 3] Seed set to 2 [rank: 1] Seed set to 2 [rank: 2] Seed set to 2 Predicting DataLoader 0: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 13.60it/s]

Parsed results NHITS ETTh1 h=96 test_size 2880 y_true.shape (n_series, n_windows, n_time_out): (7, 2785, 96) y_hat.shape (n_series, n_windows, n_time_out): (7, 2785, 96) MSE: 0.33694613410636276 MAE: 0.4027628248310379 Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/4 Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/4 Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/4 [rank2]: Traceback (most recent call last): [rank2]: File "/home/sagemaker-user/Nixtla/neuralforecast/experiments/long_horizon/run_nhits.py", line 84, in [rank2]: Y_hat_df = nf.cross_validation(df=Y_df, val_size=val_size, [rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank2]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/neuralforecast/core.py", line 1187, in cross_validation [rank2]: return self._no_refit_cross_validation( [rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank2]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/neuralforecast/core.py", line 1036, in _no_refit_cross_validation [rank2]: model.fit(dataset=self.dataset, val_size=val_size, test_size=test_size) [rank2]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/neuralforecast/common/_base_auto.py", line 433, in fit [rank2]: self.model = self._fit_model( [rank2]: ^^^^^^^^^^^^^^^^ [rank2]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/neuralforecast/common/_base_auto.py", line 366, in _fit_model [rank2]: model = model.fit( [rank2]: ^^^^^^^^^^ [rank2]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/neuralforecast/common/_base_model.py", line 1468, in fit [rank2]: return self._fit( [rank2]: ^^^^^^^^^^ [rank2]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/neuralforecast/common/_base_model.py", line 546, in _fit [rank2]: trainer.fit(model, datamodule=datamodule) [rank2]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 539, in fit [rank2]: call._call_and_handle_interrupt( [rank2]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/pytorch_lightning/trainer/call.py", line 46, in _call_and_handle_interrupt [rank2]: return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs) [rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank2]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 105, in launch [rank2]: return function(*args, **kwargs) [rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^ [rank2]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 575, in _fit_impl [rank2]: self._run(model, ckpt_path=ckpt_path) [rank2]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 939, in _run [rank2]: self.__setup_profiler() [rank2]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 1072, in __setup_profiler [rank2]: self.profiler.setup(stage=self.state.fn, local_rank=local_rank, log_dir=self.log_dir) [rank2]: ^^^^^^^^^^^^ [rank2]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 1234, in log_dir [rank2]: dirpath = self.strategy.broadcast(dirpath) [rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank2]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/pytorch_lightning/strategies/ddp.py", line 307, in broadcast [rank2]: torch.distributed.broadcast_object_list(obj, src, group=_group.WORLD) [rank2]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper [rank2]: return func(*args, **kwargs) [rank2]: ^^^^^^^^^^^^^^^^^^^^^ [rank2]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 3479, in broadcast_object_list [rank2]: broadcast(object_sizes_tensor, src=global_src, group=group) [rank2]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper [rank2]: return func(*args, **kwargs) [rank2]: ^^^^^^^^^^^^^^^^^^^^^ [rank2]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2726, in broadcast [rank2]: work = group.broadcast([tensor], opts) [rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank2]: torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/NCCLUtils.hpp:268, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.21.5 [rank2]: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. [rank2]: Last error: [rank2]: socketStartConnect: Connect to 169.255.255.2<56139> failed : Software caused connection abort [rank1]: Traceback (most recent call last): [rank1]: File "/home/sagemaker-user/Nixtla/neuralforecast/experiments/long_horizon/run_nhits.py", line 84, in [rank1]: Y_hat_df = nf.cross_validation(df=Y_df, val_size=val_size, [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/neuralforecast/core.py", line 1187, in cross_validation [rank1]: return self._no_refit_cross_validation( [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/neuralforecast/core.py", line 1036, in _no_refit_cross_validation [rank1]: model.fit(dataset=self.dataset, val_size=val_size, test_size=test_size) [rank1]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/neuralforecast/common/_base_auto.py", line 433, in fit [rank1]: self.model = self._fit_model( [rank1]: ^^^^^^^^^^^^^^^^ [rank1]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/neuralforecast/common/_base_auto.py", line 366, in _fit_model [rank1]: model = model.fit( [rank1]: ^^^^^^^^^^ [rank1]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/neuralforecast/common/_base_model.py", line 1468, in fit [rank1]: return self._fit( [rank1]: ^^^^^^^^^^ [rank1]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/neuralforecast/common/_base_model.py", line 546, in _fit [rank1]: trainer.fit(model, datamodule=datamodule) [rank1]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 539, in fit [rank1]: call._call_and_handle_interrupt( [rank1]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/pytorch_lightning/trainer/call.py", line 46, in _call_and_handle_interrupt [rank1]: return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs) [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 105, in launch [rank1]: return function(*args, **kwargs) [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 575, in _fit_impl [rank1]: self._run(model, ckpt_path=ckpt_path) [rank1]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 939, in _run [rank1]: self.__setup_profiler() [rank1]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 1072, in __setup_profiler [rank1]: self.profiler.setup(stage=self.state.fn, local_rank=local_rank, log_dir=self.log_dir) [rank1]: ^^^^^^^^^^^^ [rank1]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 1234, in log_dir [rank1]: dirpath = self.strategy.broadcast(dirpath) [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/pytorch_lightning/strategies/ddp.py", line 307, in broadcast [rank1]: torch.distributed.broadcast_object_list(obj, src, group=_group.WORLD) [rank1]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper [rank1]: return func(*args, **kwargs) [rank1]: ^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 3479, in broadcast_object_list [rank1]: broadcast(object_sizes_tensor, src=global_src, group=group) [rank1]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper [rank1]: return func(*args, **kwargs) [rank1]: ^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2726, in broadcast [rank1]: work = group.broadcast([tensor], opts) [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/NCCLUtils.hpp:268, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.21.5 [rank1]: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. [rank1]: Last error: [rank1]: socketStartConnect: Connect to 169.255.255.2<56139> failed : Software caused connection abort [rank3]: Traceback (most recent call last): [rank3]: File "/home/sagemaker-user/Nixtla/neuralforecast/experiments/long_horizon/run_nhits.py", line 84, in [rank3]: Y_hat_df = nf.cross_validation(df=Y_df, val_size=val_size, [rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank3]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/neuralforecast/core.py", line 1187, in cross_validation [rank3]: return self._no_refit_cross_validation( [rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank3]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/neuralforecast/core.py", line 1036, in _no_refit_cross_validation [rank3]: model.fit(dataset=self.dataset, val_size=val_size, test_size=test_size) [rank3]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/neuralforecast/common/_base_auto.py", line 433, in fit [rank3]: self.model = self._fit_model( [rank3]: ^^^^^^^^^^^^^^^^ [rank3]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/neuralforecast/common/_base_auto.py", line 366, in _fit_model [rank3]: model = model.fit( [rank3]: ^^^^^^^^^^ [rank3]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/neuralforecast/common/_base_model.py", line 1468, in fit [rank3]: return self._fit( [rank3]: ^^^^^^^^^^ [rank3]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/neuralforecast/common/_base_model.py", line 546, in _fit [rank3]: trainer.fit(model, datamodule=datamodule) [rank3]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 539, in fit [rank3]: call._call_and_handle_interrupt( [rank3]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/pytorch_lightning/trainer/call.py", line 46, in _call_and_handle_interrupt [rank3]: return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs) [rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank3]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 105, in launch [rank3]: return function(*args, **kwargs) [rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^ [rank3]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 575, in _fit_impl [rank3]: self._run(model, ckpt_path=ckpt_path) [rank3]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 939, in _run [rank3]: self.__setup_profiler() [rank3]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 1072, in __setup_profiler [rank3]: self.profiler.setup(stage=self.state.fn, local_rank=local_rank, log_dir=self.log_dir) [rank3]: ^^^^^^^^^^^^ [rank3]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 1234, in log_dir [rank3]: dirpath = self.strategy.broadcast(dirpath) [rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank3]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/pytorch_lightning/strategies/ddp.py", line 307, in broadcast [rank3]: torch.distributed.broadcast_object_list(obj, src, group=_group.WORLD) [rank3]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper [rank3]: return func(*args, **kwargs) [rank3]: ^^^^^^^^^^^^^^^^^^^^^ [rank3]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 3479, in broadcast_object_list [rank3]: broadcast(object_sizes_tensor, src=global_src, group=group) [rank3]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper [rank3]: return func(*args, **kwargs) [rank3]: ^^^^^^^^^^^^^^^^^^^^^ [rank3]: File "/home/sagemaker-user/.conda/envs/long_horizon/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2726, in broadcast [rank3]: work = group.broadcast([tensor], opts) [rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank3]: torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/NCCLUtils.hpp:268, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.21.5 [rank3]: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. [rank3]: Last error: [rank3]: socketStartConnect: Connect to 169.255.255.2<56139> failed : Software caused connection abort [rank: 1] Child process with PID 23289 terminated with code 1. Forcefully terminating all other processes to avoid zombies 🧟

Versions / Dependencies

Here is the pip freeze output: aiohappyeyeballs==2.6.1 aiohttp==3.11.14 aiosignal==1.3.2 alembic==1.15.1 attrs==25.3.0 certifi==2025.1.31 charset-normalizer==3.4.1 click==8.1.8 colorlog==6.9.0 coreforecast==0.0.15 datasetsforecast @ git+https://github.com/Nixtla/datasetsforecast.git@c0023084c52c244740598affe7afafa3d59f2729 filelock==3.18.0 frozenlist==1.5.0 fsspec==2025.3.0 greenlet==3.1.1 idna==3.10 Jinja2==3.1.6 joblib==1.4.2 jsonschema==4.23.0 jsonschema-specifications==2024.10.1 lightning-utilities==0.14.1 Mako==1.3.9 MarkupSafe==3.0.2 mpmath==1.3.0 msgpack==1.1.0 multidict==6.2.0 networkx==3.4.2 neuralforecast @ git+https://github.com/Nixtla/neuralforecast.git@e2f473a51ba15fbf4c33ff76cc8d1687ab68c517 numpy @ file:///home/conda/feedstock_root/build_artifacts/numpy_1668919096335/work nvidia-cublas-cu12==12.4.5.8 nvidia-cuda-cupti-cu12==12.4.127 nvidia-cuda-nvrtc-cu12==12.4.127 nvidia-cuda-runtime-cu12==12.4.127 nvidia-cudnn-cu12==9.1.0.70 nvidia-cufft-cu12==11.2.1.3 nvidia-curand-cu12==10.3.5.147 nvidia-cusolver-cu12==11.6.1.9 nvidia-cusparse-cu12==12.3.1.170 nvidia-cusparselt-cu12==0.6.2 nvidia-nccl-cu12==2.21.5 nvidia-nvjitlink-cu12==12.4.127 nvidia-nvtx-cu12==12.4.127 optuna==4.2.1 packaging==24.2 pandas==2.2.3 propcache==0.3.0 protobuf==6.30.1 pyarrow==19.0.1 python-dateutil==2.9.0.post0 pytorch-lightning==2.5.0.post0 pytz==2025.1 PyYAML==6.0.2 ray==2.43.0 referencing==0.36.2 requests==2.32.3 rpds-py==0.23.1 scikit-learn==1.6.1 scipy==1.15.2 six==1.17.0 SQLAlchemy==2.0.39 sympy==1.13.1 tensorboardX==2.6.2.2 threadpoolctl==3.6.0 torch==2.6.0 torchmetrics==1.6.3 tqdm==4.67.1 triton==3.2.0 typing_extensions==4.12.2 tzdata==2025.1 urllib3==2.3.0 utilsforecast==0.2.12 xlrd==2.0.1 yarl==1.18.3

Reproduction script

On a node with multiple GPUs (e.g., an AWS ml.g4dn.12xlarge with 4 GPUs):

  • git clone https://github.com/Nixtla/neuralforecast.git
  • cd neuralforecast/experiments/long_horizon
  • conda env create -f environment.yml
  • conda activate long_horizon
  • python run_nhits.py --dataset 'ETTh1' --horizon 96 --num_samples 1

Issue Severity

Medium: It is a significant difficulty but I can work around it.

bstewart311 avatar Mar 18 '25 20:03 bstewart311

If I add 'devices': 1, to the nhits_config section of run_nhits.py, the code runs to completion correctly:

    "val_check_steps": tune.choice([100]),      # Compute validation every 100 epochs
    "random_seed": tune.randint(1, 10),
    # "devices": 1,                             # enable this configuration parameter to run to completion

bstewart311 avatar Mar 18 '25 20:03 bstewart311

Hi! To the best of my knowledge, this is a known issue, and the only way to fix it is to use a single GPU like you did. I'll investigate further when I get more time!

marcopeix avatar Mar 20 '25 13:03 marcopeix

Thank you @marcopeix! I ran the same procedure on a vanilla g4dn.12xlarge instance (4 GPUs) and confirmed that the same issue exists as with the ml.g4dn.12xlarge, so I believe the issue is not a version-mismatch issue with the (many!) default libraries AWS installs on their ML instances.

bstewart311 avatar Mar 23 '25 09:03 bstewart311

@bstewart311 What version of NF are you using? Can you try with the latest version?

elephaint avatar Mar 23 '25 10:03 elephaint

@elephaint Apologies, the below is a lot of detail to wade through! The short answer is, in this example I am installing NF 3.0.0 directly from the latest main branch. See my next comment for the other ways I've tried to stand up a working NF 3.0.0 install.

For the purpose of demonstrating the issue in an easily reproducible way, I used the environment.yml in the neuralforecast/experiments/long_horizon, which uses pip to install the latest main branch (if I understand the "pip install git+https..." construct correctly). Here's the environment.yml:

`name: long_horizon channels:

  • conda-forge dependencies:
  • numpy<1.24
  • pip
  • pip:
    • "git+https://github.com/Nixtla/datasetsforecast.git"
    • "git+https://github.com/Nixtla/neuralforecast.git"`

When I run conda env create -f environment.yml, here's the result (key packages in bold):

Successfully installed Mako-1.3.9 MarkupSafe-3.0.2 PyYAML-6.0.2 aiohappyeyeballs-2.6.1 aiohttp-3.11.14 aiosignal-1.3.2 alembic-1.15.1 attrs-25.3.0 certifi-2025.1.31 charset-normalizer-3.4.1 click-8.1.8 colorlog-6.9.0 coreforecast-0.0.15 datasetsforecast-1.0.0 filelock-3.18.0 frozenlist-1.5.0 fsspec-2025.3.0 greenlet-3.1.1 idna-3.10 jinja2-3.1.6 joblib-1.4.2 jsonschema-4.23.0 jsonschema-specifications-2024.10.1 lightning-utilities-0.14.2 mpmath-1.3.0 msgpack-1.1.0 multidict-6.2.0 networkx-3.4.2 neuralforecast-3.0.0 nvidia-cublas-cu12-12.4.5.8 nvidia-cuda-cupti-cu12-12.4.127 nvidia-cuda-nvrtc-cu12-12.4.127 nvidia-cuda-runtime-cu12-12.4.127 nvidia-cudnn-cu12-9.1.0.70 nvidia-cufft-cu12-11.2.1.3 nvidia-curand-cu12-10.3.5.147 nvidia-cusolver-cu12-11.6.1.9 nvidia-cusparse-cu12-12.3.1.170 nvidia-cusparselt-cu12-0.6.2 nvidia-nccl-cu12-2.21.5 nvidia-nvjitlink-cu12-12.4.127 nvidia-nvtx-cu12-12.4.127 optuna-4.2.1 packaging-24.2 pandas-2.2.3 propcache-0.3.0 protobuf-6.30.1 pyarrow-19.0.1 python-dateutil-2.9.0.post0 pytorch-lightning-2.5.1 pytz-2025.1 ray-2.44.0 referencing-0.36.2 requests-2.32.3 rpds-py-0.23.1 scikit-learn-1.6.1 scipy-1.15.2 six-1.17.0 sqlalchemy-2.0.39 sympy-1.13.1 tensorboardX-2.6.2.2 threadpoolctl-3.6.0 torch-2.6.0 torchmetrics-1.7.0 tqdm-4.67.1 triton-3.2.0 typing-extensions-4.12.2 tzdata-2025.2 urllib3-2.3.0 utilsforecast-0.2.12 xlrd-2.0.1 yarl-1.18.3

And when I run conda list in the newly created long_horizon environment I get the following:

`# packages in environment at /home/sagemaker-user/.conda/envs/long_horizon:

Name Version Build Channel

_libgcc_mutex 0.1 conda_forge conda-forge _openmp_mutex 4.5 2_gnu conda-forge aiohappyeyeballs 2.6.1 pypi_0 pypi aiohttp 3.11.14 pypi_0 pypi aiosignal 1.3.2 pypi_0 pypi alembic 1.15.1 pypi_0 pypi attrs 25.3.0 pypi_0 pypi bzip2 1.0.8 h4bc722e_7 conda-forge ca-certificates 2025.1.31 hbcca054_0 conda-forge certifi 2025.1.31 pypi_0 pypi charset-normalizer 3.4.1 pypi_0 pypi click 8.1.8 pypi_0 pypi colorlog 6.9.0 pypi_0 pypi coreforecast 0.0.15 pypi_0 pypi datasetsforecast 1.0.0 pypi_0 pypi filelock 3.18.0 pypi_0 pypi frozenlist 1.5.0 pypi_0 pypi fsspec 2025.3.0 pypi_0 pypi greenlet 3.1.1 pypi_0 pypi idna 3.10 pypi_0 pypi jinja2 3.1.6 pypi_0 pypi joblib 1.4.2 pypi_0 pypi jsonschema 4.23.0 pypi_0 pypi jsonschema-specifications 2024.10.1 pypi_0 pypi ld_impl_linux-64 2.43 h712a8e2_4 conda-forge libblas 3.9.0 31_h59b9bed_openblas conda-forge libcblas 3.9.0 31_he106b2a_openblas conda-forge libexpat 2.6.4 h5888daf_0 conda-forge libffi 3.4.6 h2dba641_0 conda-forge libgcc 14.2.0 h767d61c_2 conda-forge libgcc-ng 14.2.0 h69a702a_2 conda-forge libgfortran 14.2.0 h69a702a_2 conda-forge libgfortran5 14.2.0 hf1ad2bd_2 conda-forge libgomp 14.2.0 h767d61c_2 conda-forge liblapack 3.9.0 31_h7ac8fdf_openblas conda-forge liblzma 5.6.4 hb9d3cd8_0 conda-forge libnsl 2.0.1 hd590300_0 conda-forge libopenblas 0.3.29 pthreads_h94d23a6_0 conda-forge libsqlite 3.49.1 hee588c1_2 conda-forge libstdcxx 14.2.0 h8f9b012_2 conda-forge libstdcxx-ng 14.2.0 h4852527_2 conda-forge libuuid 2.38.1 h0b41bf4_0 conda-forge libxcrypt 4.4.36 hd590300_1 conda-forge libzlib 1.3.1 hb9d3cd8_2 conda-forge lightning-utilities 0.14.2 pypi_0 pypi mako 1.3.9 pypi_0 pypi markupsafe 3.0.2 pypi_0 pypi mpmath 1.3.0 pypi_0 pypi msgpack 1.1.0 pypi_0 pypi multidict 6.2.0 pypi_0 pypi ncurses 6.5 h2d0b736_3 conda-forge networkx 3.4.2 pypi_0 pypi neuralforecast 3.0.0 pypi_0 pypi numpy 1.23.5 py311h7d28db0_0 conda-forge nvidia-cublas-cu12 12.4.5.8 pypi_0 pypi nvidia-cuda-cupti-cu12 12.4.127 pypi_0 pypi nvidia-cuda-nvrtc-cu12 12.4.127 pypi_0 pypi nvidia-cuda-runtime-cu12 12.4.127 pypi_0 pypi nvidia-cudnn-cu12 9.1.0.70 pypi_0 pypi nvidia-cufft-cu12 11.2.1.3 pypi_0 pypi nvidia-curand-cu12 10.3.5.147 pypi_0 pypi nvidia-cusolver-cu12 11.6.1.9 pypi_0 pypi nvidia-cusparse-cu12 12.3.1.170 pypi_0 pypi nvidia-cusparselt-cu12 0.6.2 pypi_0 pypi nvidia-nccl-cu12 2.21.5 pypi_0 pypi nvidia-nvjitlink-cu12 12.4.127 pypi_0 pypi nvidia-nvtx-cu12 12.4.127 pypi_0 pypi openssl 3.4.1 h7b32b05_0 conda-forge optuna 4.2.1 pypi_0 pypi packaging 24.2 pypi_0 pypi pandas 2.2.3 pypi_0 pypi pip 25.0.1 pyh8b19718_0 conda-forge propcache 0.3.0 pypi_0 pypi protobuf 6.30.1 pypi_0 pypi pyarrow 19.0.1 pypi_0 pypi python 3.11.11 h9e4cc4f_2_cpython conda-forge python-dateutil 2.9.0.post0 pypi_0 pypi python_abi 3.11 5_cp311 conda-forge pytorch-lightning 2.5.1 pypi_0 pypi pytz 2025.1 pypi_0 pypi pyyaml 6.0.2 pypi_0 pypi ray 2.44.0 pypi_0 pypi readline 8.2 h8c095d6_2 conda-forge referencing 0.36.2 pypi_0 pypi requests 2.32.3 pypi_0 pypi rpds-py 0.23.1 pypi_0 pypi scikit-learn 1.6.1 pypi_0 pypi scipy 1.15.2 pypi_0 pypi setuptools 75.8.2 pyhff2d567_0 conda-forge six 1.17.0 pypi_0 pypi sqlalchemy 2.0.39 pypi_0 pypi sympy 1.13.1 pypi_0 pypi tensorboardx 2.6.2.2 pypi_0 pypi threadpoolctl 3.6.0 pypi_0 pypi tk 8.6.13 noxft_h4845f30_101 conda-forge torch 2.6.0 pypi_0 pypi torchmetrics 1.7.0 pypi_0 pypi tqdm 4.67.1 pypi_0 pypi triton 3.2.0 pypi_0 pypi typing-extensions 4.12.2 pypi_0 pypi tzdata 2025.2 pypi_0 pypi urllib3 2.3.0 pypi_0 pypi utilsforecast 0.2.12 pypi_0 pypi wheel 0.45.1 pyhd8ed1ab_1 conda-forge xlrd 2.0.1 pypi_0 pypi yarl 1.18.3 pypi_0 pypi`

bstewart311 avatar Mar 24 '25 08:03 bstewart311

@elephaint Theorizing that there might be a version misalignment in key packages causing these problems, I have experimented with several alternative installation approaches and have reproduced the error every time. Those approaches include:

  • Create a "virgin" conda environment and install neuralforecast via conda-forge (conda install -c conda-forge neuralforecast), which installs these key package versions:

neuralforecast 3.0.0 pyhd8ed1ab_0 conda-forge python 3.9.21 h9c0c6dc_1_cpython conda-forge ray-core 2.43.0 py39h55c4102_2 conda-forge ray-default 2.43.0 py39hd8b8447_2 conda-forge ray-tune 2.43.0 py39hf3d152e_2 conda-forge libtorch 2.6.0 cuda126_generic_h4a15719_200 conda-forge pytorch 2.6.0 cuda126_generic_py39_h22b98e6_200 conda-forge pytorch-lightning 2.5.1 pyh506cb10_0 conda-forge torchmetrics 1.7.0 pyhd8ed1ab_0 conda-forge cuda-version 12.8 h5d125a7_3 conda-forge

  • Create a "virgin" conda environment with python=3.10 and install neuralforecast via conda-forge (conda install -c conda-forge neuralforecast), which installs these key package versions:

neuralforecast 3.0.0 pyhd8ed1ab_0 conda-forge python 3.10.16 he725a3c_1_cpython conda-forge ray-core 2.43.0 py310h34c9fef_0 conda-forge ray-default 2.43.0 py310h5955c3f_0 conda-forge ray-tune 2.43.0 py310hff52083_0 conda-forge libtorch 2.6.0 cuda126_generic_h4a15719_200 conda-forge pytorch 2.6.0 cuda126_generic_py310_h9bb2754_200 conda-forge pytorch-lightning 2.5.1 pyh506cb10_0 conda-forge torchmetrics 1.7.0 pyhd8ed1ab_0 conda-forge cuda-version 12.8 h5d125a7_3 conda-forge

  • Per the instructions in neuralforecast/CONTRIBUTING.md, use neuralforecast/environment-cuda.yml (conda create -n neuralforecast python=3.10; conda env update -f environment-cuda.yml; conda install -c conda-forge neuralforecast), which installs:

neuralforecast 3.0.0 pyhd8ed1ab_0 conda-forge python 3.10.16 he725a3c_1_cpython conda-forge ray 2.44.0 pypi_0 pypi ray-core 2.43.0 py310h34c9fef_0 conda-forge ray-default 2.43.0 py310h5955c3f_0 conda-forge ray-tune 2.43.0 py310hff52083_0 conda-forge pytorch 2.5.1 py3.10_cuda12.4_cudnn9.1.0_0 pytorch pytorch-cuda 12.4 hc786d27_7 pytorch pytorch-lightning 2.5.1 pyh506cb10_0 conda-forge pytorch-mutex 1.0 cuda pytorch torchmetrics 1.7.0 pyhd8ed1ab_0 conda-forge torchtriton 3.1.0 py310 pytorch cuda-version 12.8 3 nvidia

  • Some variants of the above using mixed conda/pip commands

All of these approaches failed with the same "ray actor died unexpectedly" error above.

bstewart311 avatar Mar 24 '25 09:03 bstewart311

@elephaint I originally ran into this issue with my own datasets (both synthetic and real timeseries with and without covariates, and using AutoDeepAR/AutoRNN rather than AutoNHITS as in the above), so I believe this is neither a data issue nor an issue with the specific Auto... model.

bstewart311 avatar Mar 24 '25 10:03 bstewart311

@bstewart311 Thanks! I haven't had any issues with multi-gpu training on aws with NF 3.0.0, but I didn't try Auto models yet. Can you try with Optuna instead of Ray? That would help isolate if the problem is in our Auto model code or if it's the tuning backend.

You can specify using backend="optuna" in an Auto model.

elephaint avatar Mar 24 '25 13:03 elephaint

@elephaint The run succeeded when I used backend='optuna'.

Here's the run_nhits.py that I updated with Optuna in place of Ray as the backend. I upped the max_steps to 1000 and saw that all 4 GPUs were fully running, so this is a success!


import os
os.environ["PYTORCH_ENABLE_MPS_FALLBACK"] = "1"

import argparse
import pandas as pd

from ray import tune
import optuna
optuna.logging.set_verbosity(optuna.logging.WARNING) # Use this to disable training prints from optuna

from neuralforecast.auto import AutoNHITS
from neuralforecast.core import NeuralForecast

from neuralforecast.losses.pytorch import MAE, HuberLoss
from neuralforecast.losses.numpy import mae, mse
#from datasetsforecast.long_horizon import LongHorizon, LongHorizonInfo
from datasetsforecast.long_horizon2 import LongHorizon2, LongHorizon2Info

import logging
logging.getLogger("pytorch_lightning").setLevel(logging.WARNING)


if __name__ == '__main__':

    # Parse execution parameters
    verbose = True
    parser = argparse.ArgumentParser()
    parser.add_argument("-horizon", "--horizon", type=int)
    parser.add_argument("-dataset", "--dataset", type=str)
    parser.add_argument("-num_samples", "--num_samples", default=5, type=int)

    args = parser.parse_args()
    horizon = args.horizon
    dataset = args.dataset
    num_samples = args.num_samples

    assert horizon in [96, 192, 336, 720]

    # Load dataset
    #Y_df, _, _ = LongHorizon.load(directory='./data/', group=dataset)
    #Y_df['ds'] = pd.to_datetime(Y_df['ds'])

    Y_df = LongHorizon2.load(directory='./data/', group=dataset)
    freq = LongHorizon2Info[dataset].freq
    n_time = len(Y_df.ds.unique())
    #val_size = int(.2 * n_time)
    #test_size = int(.2 * n_time)
    val_size = LongHorizon2Info[dataset].val_size
    test_size = LongHorizon2Info[dataset].test_size

    # Adapt input_size to available data
    input_size = tune.choice([7 * horizon])
    if dataset=='ETTm1' and horizon==720:
        input_size = tune.choice([2 * horizon])

    # nhits_config = {
    #     #"learning_rate": tune.choice([1e-3]),                                     # Initial Learning rate
    #     "learning_rate": tune.loguniform(1e-5, 5e-3),
    #     "max_steps": tune.choice([200, 1000]),                                    # Number of SGD steps
    #     "input_size": input_size,                                                 # input_size = multiplier * horizon
    #     "batch_size": tune.choice([7]),                                           # Number of series in windows
    #     "windows_batch_size": tune.choice([256]),                                 # Number of windows in batch
    #     "n_pool_kernel_size": tune.choice([[2, 2, 2], [16, 8, 1]]),               # MaxPool's Kernelsize
    #     "n_freq_downsample": tune.choice([[(96*7)//2, 96//2, 1],
    #                                       [(24*7)//2, 24//2, 1],
    #                                       [1, 1, 1]]),                            # Interpolation expressivity ratios
    #     "dropout_prob_theta": tune.choice([0.5]),                                 # Dropout regularization
    #     "activation": tune.choice(['ReLU']),                                      # Type of non-linear activation
    #     "n_blocks":  tune.choice([[1, 1, 1]]),                                    # Blocks per each 3 stacks
    #     "mlp_units":  tune.choice([[[512, 512], [512, 512], [512, 512]]]),        # 2 512-Layers per block for each stack
    #     "interpolation_mode": tune.choice(['linear']),                            # Type of multi-step interpolation
    #     "val_check_steps": tune.choice([100]),                                    # Compute validation every 100 epochs
    #     "random_seed": tune.randint(1, 10),
    #     # "devices": 1,
    #     }

    def nhits_config(trial):
        return {
            "max_steps": 1000,                                                                                               # Number of SGD steps
            "input_size": 24,                                                                                               # Size of input window
            "learning_rate": trial.suggest_loguniform("learning_rate", 1e-5, 1e-1),                                         # Initial Learning rate
            # "n_pool_kernel_size": trial.suggest_categorical("n_pool_kernel_size", [[2, 2, 2], [16, 8, 1]]),                 # MaxPool's Kernelsize
            # "n_freq_downsample": trial.suggest_categorical("n_freq_downsample", [[168, 24, 1], [24, 12, 1], [1, 1, 1]]),    # Interpolation expressivity ratios
            "val_check_steps": 50,                                                                                          # Compute validation every 50 steps
            "random_seed": trial.suggest_int("random_seed", 1, 10),                                                         # Random seed
        }
        
    models = [AutoNHITS(h=horizon,
                        loss=HuberLoss(delta=0.5),
                        valid_loss=MAE(),
                        config=nhits_config, 
                        num_samples=num_samples,
                        refit_with_val=True,
                        backend='optuna',
                       )]

    nf = NeuralForecast(models=models, freq=freq)

    Y_hat_df = nf.cross_validation(df=Y_df, val_size=val_size,
                                   test_size=test_size, n_windows=None)


    y_true = Y_hat_df.y.values
    y_hat = Y_hat_df['AutoNHITS'].values

    n_series = len(Y_df.unique_id.unique())

    y_true = y_true.reshape(n_series, -1, horizon)
    y_hat = y_hat.reshape(n_series, -1, horizon)

    print('\n'*4)
    print('Parsed results')
    print(f'NHITS {dataset} h={horizon}')
    print('test_size', test_size)
    print('y_true.shape (n_series, n_windows, n_time_out):\t', y_true.shape)
    print('y_hat.shape  (n_series, n_windows, n_time_out):\t', y_hat.shape)

    print('MSE: ', mse(y_hat, y_true))
    print('MAE: ', mae(y_hat, y_true))

    # Save Outputs
    if not os.path.exists(f'./data/{dataset}'):
        os.makedirs(f'./data/{dataset}')
    yhat_file = f'./data/{dataset}/{horizon}_forecasts.csv'
    Y_hat_df.to_csv(yhat_file, index=False)

bstewart311 avatar Mar 24 '25 18:03 bstewart311

@elephaint I thought I read in the Nixtla documentation a few weeks ago that Optuna only utilized 1 GPU, such that if one wants to utilize multiple GPUs one must use Ray Tune. I thought I read that here, but now I can't find it.

Has Optuna begun to support multiple GPUs recently? If so, it's great news! I'll do some more experiments on my end with Optuna, to see if it meets my needs. At the same time, I will keep this issue open and continue to help debug the Ray issue.

bstewart311 avatar Mar 24 '25 18:03 bstewart311

Thanks for reporting, great that this works now. The issue with Ray still remains, but it is then likely a harder issue for us to solve.

I'd have to read up on whether Optuna supports multi-gpu, but I guess apparently it does now 😜

elephaint avatar Mar 24 '25 19:03 elephaint