pytorch-forecasting
pytorch-forecasting copied to clipboard
Multi-GPU training results in "ProcessExitedException process 0 terminated with signal SIGSEGV" exception for Baseline and TFT models.
- PyTorch-Forecasting version: 1.0.0
- PyTorch version: 2.0.1+cu117
- Lightning version: 2.0.4
- Python version: 3.10.11
- Operating System: Linux-5.10.0-23-cloud-amd64-x86_64-with-glibc2.31 (Google Cloud)
Expected behavior
I am trying to run the exact code from the stallion example for TFTs on a multi-gpu device in preparation to train a similar model on my own data in the same environment. I am able to run on a single GPU machine without issue and would expect to be able to run it without issue on a multi-gpu machine (especially when specifying to use only 1 of the multiple GPUs with devices=1). I have also tested out a similar script with my own data and am running into the same issues.
Actual behavior
When I run the same code on a multi GPU machine I get the following error both fitting the Baseline model and the TFT model.
Baseline
# calculate baseline mean absolute error, i.e. predict next value as the last available value from the history
baseline_predictions = Baseline().predict(val_dataloader, return_y=True)
MAE()(baseline_predictions.output, baseline_predictions.y)
Output
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1]
---------------------------------------------------------------------------
ProcessExitedException Traceback (most recent call last)
Cell In[6], line 2
1 # # calculate baseline mean absolute error, i.e. predict next value as the last available value from the history
----> 2 baseline_predictions = Baseline().predict(val_dataloader, return_y=True)
3 MAE()(baseline_predictions.output, baseline_predictions.y)
File ~/.local/lib/python3.10/site-packages/pytorch_forecasting/models/base_model.py:1423, in BaseModel.predict(self, data, mode, return_index, return_decoder_lengths, batch_size, num_workers, fast_dev_run, return_x, return_y, mode_kwargs, trainer_kwargs, write_interval, output_dir, **kwargs)
1421 logging.getLogger("pytorch_lightning").setLevel(logging.WARNING)
1422 trainer = Trainer(fast_dev_run=fast_dev_run, **trainer_kwargs)
-> 1423 trainer.predict(self, dataloader)
1424 logging.getLogger("lightning").setLevel(log_level_lighting)
1425 logging.getLogger("pytorch_lightning").setLevel(log_level_pytorch_lightning)
File /opt/conda/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py:845, in Trainer.predict(self, model, dataloaders, datamodule, return_predictions, ckpt_path)
843 model = _maybe_unwrap_optimized(model)
844 self.strategy._lightning_module = model
--> 845 return call._call_and_handle_interrupt(
846 self, self._predict_impl, model, dataloaders, datamodule, return_predictions, ckpt_path
847 )
File /opt/conda/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py:41, in _call_and_handle_interrupt(trainer, trainer_fn, *args, **kwargs)
39 try:
40 if trainer.strategy.launcher is not None:
---> 41 return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
42 return trainer_fn(*args, **kwargs)
44 except _TunerExitException:
File /opt/conda/lib/python3.10/site-packages/lightning/pytorch/strategies/launchers/multiprocessing.py:124, in _MultiProcessingLauncher.launch(self, function, trainer, *args, **kwargs)
116 process_context = mp.start_processes(
117 self._wrapping_function,
118 args=process_args,
(...)
121 join=False, # we will join ourselves to get the process references
122 )
123 self.procs = process_context.processes
--> 124 while not process_context.join():
125 pass
127 worker_output = return_queue.get()
File /opt/conda/lib/python3.10/site-packages/torch/multiprocessing/spawn.py:140, in ProcessContext.join(self, timeout)
138 if exitcode < 0:
139 name = signal.Signals(-exitcode).name
--> 140 raise ProcessExitedException(
141 "process %d terminated with signal %s" %
142 (error_index, name),
143 error_index=error_index,
144 error_pid=failed_process.pid,
145 exit_code=exitcode,
146 signal_name=name
147 )
148 else:
149 raise ProcessExitedException(
150 "process %d terminated with exit code %d" %
151 (error_index, exitcode),
(...)
154 exit_code=exitcode
155 )
ProcessExitedException: process 0 terminated with signal SIGSEGV
TFT
# configure network and trainer
early_stop_callback = EarlyStopping(monitor="val_loss", min_delta=1e-4, patience=10, verbose=False, mode="min")
lr_logger = LearningRateMonitor() # log the learning rate
logger = TensorBoardLogger("lightning_logs") # logging results to a tensorboard
trainer = pl.Trainer(
max_epochs=10,
accelerator="cuda", #added line vs example code
strategy="ddp_notebook", #added line vs example code
devices=2, #added line vs example code
enable_model_summary=True,
gradient_clip_val=0.1,
limit_train_batches=50, # coment in for training, running valiation every 30 batches
# fast_dev_run=True, # comment in to check that networkor dataset has no serious bugs
callbacks=[lr_logger, early_stop_callback],
logger=logger,
)
tft = TemporalFusionTransformer.from_dataset(
training,
learning_rate=0.03,
hidden_size=16,
attention_head_size=2,
dropout=0.1,
hidden_continuous_size=8,
loss=QuantileLoss(),
log_interval=10, # uncomment for learning rate finder and otherwise, e.g. to 10 for logging every 10 batches
optimizer="Ranger",
reduce_on_plateau_patience=4,
)
print(f"Number of parameters in network: {tft.size()/1e3:.1f}k")
Output
[rank: 0] Global seed set to 42
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
[rank: 1] Global seed set to 42
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1]
| Name | Type | Params
----------------------------------------------------------------------------------------
0 | loss | QuantileLoss | 0
1 | logging_metrics | ModuleList | 0
2 | input_embeddings | MultiEmbedding | 1.3 K
3 | prescalers | ModuleDict | 256
4 | static_variable_selection | VariableSelectionNetwork | 3.4 K
5 | encoder_variable_selection | VariableSelectionNetwork | 8.0 K
6 | decoder_variable_selection | VariableSelectionNetwork | 2.7 K
7 | static_context_variable_selection | GatedResidualNetwork | 1.1 K
8 | static_context_initial_hidden_lstm | GatedResidualNetwork | 1.1 K
9 | static_context_initial_cell_lstm | GatedResidualNetwork | 1.1 K
10 | static_context_enrichment | GatedResidualNetwork | 1.1 K
11 | lstm_encoder | LSTM | 2.2 K
12 | lstm_decoder | LSTM | 2.2 K
13 | post_lstm_gate_encoder | GatedLinearUnit | 544
14 | post_lstm_add_norm_encoder | AddNorm | 32
15 | static_enrichment | GatedResidualNetwork | 1.4 K
16 | multihead_attn | InterpretableMultiHeadAttention | 808
17 | post_attn_gate_norm | GateAddNorm | 576
18 | pos_wise_ff | GatedResidualNetwork | 1.1 K
19 | pre_output_gate_norm | GateAddNorm | 576
20 | output_layer | Linear | 119
----------------------------------------------------------------------------------------
29.4 K Trainable params
0 Non-trainable params
29.4 K Total params
0.118 Total estimated model params size (MB)
ProcessExitedException Traceback (most recent call last)
Cell In[11], line 2
1 # fit network
----> 2 trainer.fit(
3 tft,
4 train_dataloaders=train_dataloader,
5 val_dataloaders=val_dataloader,
6 )
File /opt/conda/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py:531, in Trainer.fit(self, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path)
529 model = _maybe_unwrap_optimized(model)
530 self.strategy._lightning_module = model
--> 531 call._call_and_handle_interrupt(
532 self, self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
533 )
File /opt/conda/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py:41, in _call_and_handle_interrupt(trainer, trainer_fn, *args, **kwargs)
39 try:
40 if trainer.strategy.launcher is not None:
---> 41 return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
42 return trainer_fn(*args, **kwargs)
44 except _TunerExitException:
File /opt/conda/lib/python3.10/site-packages/lightning/pytorch/strategies/launchers/multiprocessing.py:124, in _MultiProcessingLauncher.launch(self, function, trainer, *args, **kwargs)
116 process_context = mp.start_processes(
117 self._wrapping_function,
118 args=process_args,
(...)
121 join=False, # we will join ourselves to get the process references
122 )
123 self.procs = process_context.processes
--> 124 while not process_context.join():
125 pass
127 worker_output = return_queue.get()
File /opt/conda/lib/python3.10/site-packages/torch/multiprocessing/spawn.py:140, in ProcessContext.join(self, timeout)
138 if exitcode < 0:
139 name = signal.Signals(-exitcode).name
--> 140 raise ProcessExitedException(
141 "process %d terminated with signal %s" %
142 (error_index, name),
143 error_index=error_index,
144 error_pid=failed_process.pid,
145 exit_code=exitcode,
146 signal_name=name
147 )
148 else:
149 raise ProcessExitedException(
150 "process %d terminated with exit code %d" %
151 (error_index, exitcode),
(...)
154 exit_code=exitcode
155 )
ProcessExitedException: process 0 terminated with signal SIGSEGV
Code to reproduce the problem
I copied the code exactly from here
The only changes made were additional specification of multi-gpu parameters in the TFT Trainer call:
trainer = pl.Trainer(
max_epochs=10,
accelerator="cuda", #added line vs example code
strategy="ddp_notebook", #added line vs example code
devices=2, #added line vs example code
enable_model_summary=True,
gradient_clip_val=0.1,
limit_train_batches=50, # coment in for training, running valiation every 30 batches
# fast_dev_run=True, # comment in to check that networkor dataset has no serious bugs
callbacks=[lr_logger, early_stop_callback],
logger=logger,
)
Potential Solution
I have spent a couple weeks trying to resolve these issues and it seems to be at least related to a memory sharing issue between GPUs. I have found one possible solution on the lightning forum here, but am still relatively new to this package and am struggling to figure out a generalized way to implement this fix while building the model with the from_dataset() method while also maintaining maximum flexibility of the model to train in CPU, GPU and multi-GPU environments.
I am also facing this issue
For what its worth adding
def train_dataloader(self):
return train_dataloader
to https://github.com/jdb78/pytorch-forecasting/blob/d8a4462fb12de025f8bef852df1f5b48a7ae5b7c/pytorch_forecasting/models/temporal_fusion_transformer/init.py#L29
doesn't work. Perhaps unsurprisingly.
Yeah, it's frustrating. I'm still trying to work through the issue though I've had to back-burner it lately for other priorities. Let me know if you figure anything out!
Same behavior. No luck trying to adapt the solution from the lightning forum.
Argh, I typed out a whole description of this and then lost the tab :(
I have a gist which runs the tutorial with two 3090s. Quick summary:
- Install pt forecasting as develop
- Create own TFT class
- Add
train_dataloader
andtest_dataloader
, note notval
.lightning
usestest
during training and keepsval
as the final hold out set, I think this packages usestest
andval
the other way around? - Create TFT from
__init__
not the handyfrom_dataset
Here's this gist, it is messy but should give you a start for you to solve your own problems.
@this-josh - thanks! I'll give it a shot and let you know how it goes.
Trying the code I get the error ProcessExitedException: process 0 terminated with signal SIGSEGV
Sorry, I'm not sure why. I just ran this, and it works fine
curl -O https://gist.githubusercontent.com/this-josh/744345bea2053cc75c9d6388f317ca87/raw/49e29f974b83d2fea826db8dbc1dbc924a47b5e4/train.py
mamba activate /tmp/env
mamba create --prefix ./env python=3.10 -y
pip install pytorch-forecasting lightning numpy matplotlib torch pyarrow tensorboard
python train.py
(base) ➜ /tmp nvidia-smi [28/Sep/23 | 15:47]
Thu Sep 28 15:48:35 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.125.06 Driver Version: 525.125.06 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 On | N/A |
| 0% 57C P8 36W / 350W | 448MiB / 24576MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce ... Off | 00000000:4C:00.0 Off | N/A |
| 0% 50C P8 27W / 350W | 18MiB / 24576MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 2375 G /usr/lib/xorg/Xorg 35MiB |
| 0 N/A N/A 22600 G /usr/lib/firefox/firefox 23MiB |
| 0 N/A N/A 29939 G /usr/lib/xorg/Xorg 37MiB |
| 0 N/A N/A 73571 G ...382312550063625843,131072 36MiB |
| 0 N/A N/A 74742 G gnome-control-center 4MiB |
| 0 N/A N/A 109859 G /usr/lib/xorg/Xorg 99MiB |
| 0 N/A N/A 109995 G /usr/bin/gnome-shell 130MiB |
| 1 N/A N/A 2375 G /usr/lib/xorg/Xorg 4MiB |
| 1 N/A N/A 29939 G /usr/lib/xorg/Xorg 4MiB |
| 1 N/A N/A 109859 G /usr/lib/xorg/Xorg 4MiB |
+-----------------------------------------------------------------------------+
result
29.4 K Trainable params
0 Non-trainable params
29.4 K Total params
0.118 Total estimated model params size (MB)
Epoch 0: 100%|████████████████████████████████████| 1/1 [00:00<00:00, 3.17it/s, train_loss_step=251.0, train_loss_epoch=251.0]`Trainer.fit` stopped: `max_steps=1` reached.
Epoch 0: 100%|████████████████████████████████████| 1/1 [00:00<00:00, 3.17it/s, train_loss_step=251.0, train_loss_epoch=251.0]
hope this helps.
Hm Maybe some issues because I am running on databricks. Using the same versions for the packages I get
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Running in `fast_dev_run` mode: will run the requested loop using 1 batch(es). Logging and checkpointing is suppressed.
Number of parameters in network: 29.4k
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 2 processes
----------------------------------------------------------------------------------------------------
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
| Name | Type | Params
----------------------------------------------------------------------------------------
0 | loss | QuantileLoss | 0
1 | logging_metrics | ModuleList | 0
2 | input_embeddings | MultiEmbedding | 1.3 K
3 | prescalers | ModuleDict | 256
4 | static_variable_selection | VariableSelectionNetwork | 3.4 K
5 | encoder_variable_selection | VariableSelectionNetwork | 8.0 K
6 | decoder_variable_selection | VariableSelectionNetwork | 2.7 K
7 | static_context_variable_selection | GatedResidualNetwork | 1.1 K
8 | static_context_initial_hidden_lstm | GatedResidualNetwork | 1.1 K
9 | static_context_initial_cell_lstm | GatedResidualNetwork | 1.1 K
10 | static_context_enrichment | GatedResidualNetwork | 1.1 K
11 | lstm_encoder | LSTM | 2.2 K
12 | lstm_decoder | LSTM | 2.2 K
13 | post_lstm_gate_encoder | GatedLinearUnit | 544
14 | post_lstm_add_norm_encoder | AddNorm | 32
15 | static_enrichment | GatedResidualNetwork | 1.4 K
16 | multihead_attn | InterpretableMultiHeadAttention | 808
17 | post_attn_gate_norm | GateAddNorm | 576
18 | pos_wise_ff | GatedResidualNetwork | 1.1 K
19 | pre_output_gate_norm | GateAddNorm | 576
20 | output_layer | Linear | 119
----------------------------------------------------------------------------------------
29.4 K Trainable params
0 Non-trainable params
29.4 K Total params
0.118 Total estimated model params size (MB)```
Error trace:
```ProcessExitedException Traceback (most recent call last)
File <command-2218893583918656>, line 1041
1038 print(f"Number of parameters in network: {tft.size()/1e3:.1f}k")
1040 # fit network
-> 1041 trainer.fit(
1042 tft,
1043 # train_dataloaders=train_dataloader,
1044 # val_dataloaders=val_dataloader,
1045 )
File /databricks/python/lib/python3.10/site-packages/mlflow/utils/autologging_utils/safety.py:432, in safe_patch.<locals>.safe_patch_function(*args, **kwargs)
417 if (
418 active_session_failed
419 or autologging_is_disabled(autologging_integration)
(...)
426 # warning behavior during original function execution, since autologging is being
427 # skipped
428 with set_non_mlflow_warnings_behavior_for_current_thread(
429 disable_warnings=False,
430 reroute_warnings=False,
431 ):
--> 432 return original(*args, **kwargs)
434 # Whether or not the original / underlying function has been called during the
435 # execution of patched code
436 original_has_been_called = False
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py:532, in Trainer.fit(self, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path)
530 self.strategy._lightning_module = model
531 _verify_strategy_supports_compile(model, self.strategy)
--> 532 call._call_and_handle_interrupt(
533 self, self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
534 )
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py:42, in _call_and_handle_interrupt(trainer, trainer_fn, *args, **kwargs)
40 try:
41 if trainer.strategy.launcher is not None:
---> 42 return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
43 return trainer_fn(*args, **kwargs)
45 except _TunerExitException:
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/lightning/pytorch/strategies/launchers/multiprocessing.py:127, in _MultiProcessingLauncher.launch(self, function, trainer, *args, **kwargs)
119 process_context = mp.start_processes(
120 self._wrapping_function,
121 args=process_args,
(...)
124 join=False, # we will join ourselves to get the process references
125 )
126 self.procs = process_context.processes
--> 127 while not process_context.join():
128 pass
130 worker_output = return_queue.get()
File /databricks/python/lib/python3.10/site-packages/torch/multiprocessing/spawn.py:140, in ProcessContext.join(self, timeout)
138 if exitcode < 0:
139 name = signal.Signals(-exitcode).name
--> 140 raise ProcessExitedException(
141 "process %d terminated with signal %s" %
142 (error_index, name),
143 error_index=error_index,
144 error_pid=failed_process.pid,
145 exit_code=exitcode,
146 signal_name=name
147 )
148 else:
149 raise ProcessExitedException(
150 "process %d terminated with exit code %d" %
151 (error_index, exitcode),
(...)
154 exit_code=exitcode
155 )
ProcessExitedException: process 1 terminated with signal SIGSEGV```
I fixed mine by launching the code as a script and usingstrategy="auto"
in the Trainer without any real changes to the model/dataset. Unfortunately, notebooks are notorious for having issues with multiprocessing
I fixed mine by launching the code as a script and using
strategy="auto"
in the Trainer without any real changes to the model/dataset. Unfortunately, notebooks are notorious for having issues with multiprocessing
Are you saying that this is all you changed? You didn't have to effectively rebuild the the TFT class? So fix would be >> run a script not a note book >>set strategy="auto". Can you share what version you were using?
I have been sidetracked away from this project for the past few months and am just getting back to it now. I appreciate the discussion from everyone though.
Are you saying that this is all you changed? You didn't have to effectively rebuild the the TFT class? So fix would be >> run a script not a note book >>set strategy="auto". Can you share what version you were using?
Yes, that's all I changed. No class rebuilding. I am on 1.0 (installed with pip). I think the notebook version has issues with sharing the dataset across multiple processes but there is no issue when running as scripts - fairly typical even on Ubuntu machines
Thank you I will try it out and let you know. Feels too simple to be true, but that is usually how it goes.
@joseph-mcdonald: Did you find the solution? I am also facing the same issue but couldn't find any workaround.
@joseph-mcdonald: Did you find the solution? I am also facing the same issue but couldn't find any workaround.
Wh
Are you saying that this is all you changed? You didn't have to effectively rebuild the the TFT class? So fix would be >> run a script not a note book >>set strategy="auto". Can you share what version you were using?
Yes, that's all I changed. No class rebuilding. I am on 1.0 (installed with pip). I think the notebook version has issues with sharing the dataset across multiple processes but there is no issue when running as scripts - fairly typical even on Ubuntu machines
Yes, tRosenflanz response solved it for me. set strategy="auto" and don't run it in a notebook.
.