BigDL-2.x
BigDL-2.x copied to clipboard
Training Interrupted in Multi-Processes Training
The steps to reproduce this issue are as follows:
- Prepare environment, follow the instruction in readme.
-
source bigdl-nano-init
- Start training with ipex and multi-processes:
/root/anaconda3/envs/ipex1.9/bin/python /data/analytics-zoo/python/nano/example/pytorch/semantic_segmentation/semantic_segmentation.py --data_path=/data/kitti_datasets/ --use_ipex --num_processes=4
After several epochs, the training process will be interrupted suddenly. The error message is as follows:
Epoch 35: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [01:47<00:00, 26.96s/it, loss=1.03, v_num=81]Traceback (most recent call last):
File "/data/analytics-zoo/python/nano/example/pytorch/semantic_segmentation/semantic_segmentation.py", line 330, in <module>
main(hparams)
File "/data/analytics-zoo/python/nano/example/pytorch/semantic_segmentation/semantic_segmentation.py", line 314, in main
trainer.fit(model)
File "/root/anaconda3/envs/ipex1.9/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 552, in fit
self._run(model)
File "/root/anaconda3/envs/ipex1.9/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 917, in _run
self._dispatch()
File "/root/anaconda3/envs/ipex1.9/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 985, in _dispatch
self.accelerator.start_training(self)
File "/root/anaconda3/envs/ipex1.9/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 92, in start_training
self.training_type_plugin.start_training(trainer)
File "/root/anaconda3/envs/ipex1.9/lib/python3.8/site-packages/bigdl/nano/pytorch/plugins/ddp_spawn.py", line 129, in start_training
start_processes_new(self.new_process, **self.mp_spawn_kwargs)
File "/root/anaconda3/envs/ipex1.9/lib/python3.8/site-packages/bigdl/nano/pytorch/plugins/ddp_spawn.py", line 87, in start_processes_new
while not context.join():
File "/root/anaconda3/envs/ipex1.9/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 130, in join
raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with signal SIGKILL
What if ipex is no used?
When ipex is not used, this problem does not appear, but no matter whether ipex 1.8.0 or 1.9.0 is used, this problem will exist in multi-process training.
However, because this example does not set the max_epoch parameter, we are not sure when this SIGKILL will appear. In the current test, this SIGKILL only exists in multi-process training using ipex, and it will appear in the first 150 epochs. It usually appears after the 35th epoch.