mmsegmentation
mmsegmentation copied to clipboard
Intermittent segfault errors
Thanks for your error report and we appreciate it a lot.
Checklist
- I have searched related issues but cannot get the expected help. (Yes, found no issues related to segfaults)
- The bug has not been fixed in the latest version. (Yes)
Describe the bug
I am trying to run mmsegmentation repeatedly in the mmseg Docker container, with variations. I am running on the same set of images, which are a dataset of my own labeled images. Every so often the training fails partway through with a segmentation fault error.
Note that this is the same system as https://github.com/open-mmlab/mmsegmentation/issues/1806 but with a different error, so a lot of the information will be the same.
Reproduction
- What command or script did you run?
I am running python tools/train.py /mmsegmentation/configs/{model_name}/{MCFG} --work-dir {DATA}{workdir}/ with a variety of model configs and unique workdirs. I do not know how to reproduce the error, it only appears intermittently. Any debug suggestions would be appreciated. The error appears to have shown up 3 times in 51 runs (most with 6k iterations, a few with 30k). I don't really have a clue as to how to begin debugging this.
- Did you make any modifications on the code or config? Did you understand what you have modified?
No modifications have been made to code, but as I've been testing variations a variety of modifications have been made to both the dataset config (trying a variety of augmentations) and the model config (BiSeNet-v2 and Segformer). I believe I understand the modifications made, in support of that the training runs to completion ~95% of the time. ~5% of the time it fails as described below.
- What dataset did you use?
I have a custom-labeled dataset with 2048x2448 images, 128 of them in img_dir/train/. It has 6 classes, and the decode heads of the model configs have been modified to reflect that.
Environment
- Please run
python mmseg/utils/collect_env.pyto collect necessary environment information and paste it here. - You may add addition that may be helpful for locating the problem, such as
- How you installed PyTorch [e.g., pip, conda, source]
- Other environment variables that may be related (such as
$PATH,$LD_LIBRARY_PATH,$PYTHONPATH, etc.)
sys.platform: linux
Python: 3.7.7 (default, May 7 2020, 21:25:33) [GCC 7.3.0]
CUDA available: True
GPU 0: NVIDIA GeForce GTX 1080 Ti
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 10.1, V10.1.243
GCC: gcc (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0
PyTorch: 1.6.0
PyTorch compiling details: PyTorch built with:
- GCC 7.3
- C++ Version: 201402
- Intel(R) Math Kernel Library Version 2020.0.1 Product Build 20200208 for Intel(R) 64 architecture applications
- Intel(R) MKL-DNN v1.5.0 (Git Hash e2ac1fac44c5078ca927cb9b90e1b3066a0b2ed0)
- OpenMP 201511 (a.k.a. OpenMP 4.5)
- NNPACK is enabled
- CPU capability usage: AVX2
- CUDA Runtime 10.1
- NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_37,code=compute_37
- CuDNN 7.6.3
- Magma 2.5.2
- Build settings: BLAS=MKL, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DUSE_VULKAN_WRAPPER -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_STATIC_DISPATCH=OFF,
TorchVision: 0.7.0
OpenCV: 4.6.0
MMCV: 1.3.13
MMCV Compiler: GCC 7.3
MMCV CUDA Compiler: 10.1
MMSegmentation: 0.26.0+891448f
Error traceback
2022-07-13 05:37:10,224 - mmseg - INFO - Iter [3400/6000] lr: 4.766e-03, eta: 0:32:08, time: 0.703, data_time: 0.268, memory: 7978, decode.loss_ce: 0.2584, decode.acc_seg: 89.9144, aux_0.loss_ce: 0.4057, aux_0.acc_seg: 85.5554, aux_1.loss_ce: 0.3346, aux_1.acc_seg: 87.5833, aux_2.loss_ce: 0.3693, aux_2.acc_seg: 84.5047, aux_3.loss_ce: 0.4570, aux_3.acc_seg: 78.7329, loss: 1.8250
2022-07-13 05:37:45,615 - mmseg - INFO - Iter [3450/6000] lr: 4.685e-03, eta: 0:31:29, time: 0.708, data_time: 0.273, memory: 7978, decode.loss_ce: 0.2573, decode.acc_seg: 89.9713, aux_0.loss_ce: 0.3945, aux_0.acc_seg: 86.0685, aux_1.loss_ce: 0.3284, aux_1.acc_seg: 87.7375, aux_2.loss_ce: 0.3652, aux_2.acc_seg: 84.6569, aux_3.loss_ce: 0.4518, aux_3.acc_seg: 78.8945, loss: 1.7973
2022-07-13 05:38:22,484 - mmseg - INFO - Saving checkpoint at 3500 iterations
2022-07-13 05:38:22,763 - mmseg - INFO - Iter [3500/6000] lr: 4.604e-03, eta: 0:30:52, time: 0.744, data_time: 0.303, memory: 7978, decode.loss_ce: 0.2540, decode.acc_seg: 90.0477, aux_0.loss_ce: 0.3951, aux_0.acc_seg: 85.9992, aux_1.loss_ce: 0.3257, aux_1.acc_seg: 87.7952, aux_2.loss_ce: 0.3627, aux_2.acc_seg: 84.7408, aux_3.loss_ce: 0.4547, aux_3.acc_seg: 78.8021, loss: 1.7924
[>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>] 17/17, 1.4 task/s, elapsed: 13s, ETA: 0s[ ] 0/17, elapsed: 0s, ETA:
[>> ] 1/17, 0.4 task/s, elapsed: 3s, ETA: 45s
[>>>> ] 2/17, 0.6 task/s, elapsed: 3s, ETA: 26s
[>>>>>> ] 3/17, 0.8 task/s, elapsed: 4s, ETA: 19s
[>>>>>>>> ] 4/17, 0.9 task/s, elapsed: 5s, ETA: 15s
[>>>>>>>>>> ] 5/17, 1.0 task/s, elapsed: 5s, ETA: 12s
[>>>>>>>>>>>> ] 6/17, 1.0 task/s, elapsed: 6s, ETA: 11s
[>>>>>>>>>>>>>> ] 7/17, 1.1 task/s, elapsed: 6s, ETA: 9s
[>>>>>>>>>>>>>>>> ] 8/17, 1.2 task/s, elapsed: 7s, ETA: 8s
[>>>>>>>>>>>>>>>>>> ] 9/17, 1.2 task/s, elapsed: 7s, ETA: 7s
[>>>>>>>>>>>>>>>>>>> ] 10/17, 1.2 task/s, elapsed: 8s, ETA: 6s
[>>>>>>>>>>>>>>>>>>>>> ] 11/17, 1.3 task/s, elapsed: 9s, ETA: 5s
[>>>>>>>>>>>>>>>>>>>>>>> ] 12/17, 1.3 task/s, elapsed: 9s, ETA: 4s
[>>>>>>>>>>>>>>>>>>>>>>>> ] 13/17, 1.3 task/s, elapsed: 10s, ETA: 3s
[>>>>>>>>>>>>>>>>>>>>>>>>>> ] 14/17, 1.3 task/s, elapsed: 11s, ETA: 2s
[>>>>>>>>>>>>>>>>>>>>>>>>>>>> ] 15/17, 1.3 task/s, elapsed: 11s, ETA: 1s
[>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ] 16/17, 1.4 task/s, elapsed: 12s, ETA: 1s
2022-07-13 05:38:35,215 - mmseg - INFO - per class results:
2022-07-13 05:38:35,216 - mmseg - INFO -
+------------+-------+-------+
| Class | IoU | Acc |
+------------+-------+-------+
| background | 92.58 | 96.33 |
| vine | 61.44 | 77.8 |
| trunk | 50.89 | 55.85 |
| post | 63.34 | 88.58 |
| leaf | 28.53 | 29.81 |
| sign | 80.0 | 89.11 |
+------------+-------+-------+
2022-07-13 05:38:35,216 - mmseg - INFO - Summary:
2022-07-13 05:38:35,216 - mmseg - INFO -
+-------+-------+-------+
| aAcc | mIoU | mAcc |
+-------+-------+-------+
| 92.21 | 62.79 | 72.91 |
+-------+-------+-------+
2022-07-13 05:38:35,216 - mmseg - INFO - Iter(val) [17] aAcc: 0.9221, mIoU: 0.6279, mAcc: 0.7291, IoU.background: 0.9258, IoU.vine: 0.6144, IoU.trunk: 0.5089, IoU.post: 0.6334, IoU.leaf: 0.2853, IoU.sign: 0.8000, Acc.background: 0.9633, Acc.vine: 0.7780, Acc.trunk: 0.5585, Acc.post: 0.8858, Acc.leaf: 0.2981, Acc.sign: 0.8911
2022-07-13 05:39:10,652 - mmseg - INFO - Iter [3550/6000] lr: 4.523e-03, eta: 0:30:23, time: 0.957, data_time: 0.522, memory: 7978, decode.loss_ce: 0.2495, decode.acc_seg: 90.1486, aux_0.loss_ce: 0.3914, aux_0.acc_seg: 86.2369, aux_1.loss_ce: 0.3207, aux_1.acc_seg: 87.9582, aux_2.loss_ce: 0.3573, aux_2.acc_seg: 84.9161, aux_3.loss_ce: 0.4454, aux_3.acc_seg: 79.0749, loss: 1.7643
2022-07-13 05:39:46,126 - mmseg - INFO - Iter [3600/6000] lr: 4.442e-03, eta: 0:29:44, time: 0.709, data_time: 0.275, memory: 7978, decode.loss_ce: 0.2518, decode.acc_seg: 90.1142, aux_0.loss_ce: 0.3845, aux_0.acc_seg: 86.2842, aux_1.loss_ce: 0.3220, aux_1.acc_seg: 87.9527, aux_2.loss_ce: 0.3617, aux_2.acc_seg: 84.7242, aux_3.loss_ce: 0.4522, aux_3.acc_seg: 78.8049, loss: 1.7723
2022-07-13 05:40:23,086 - mmseg - INFO - Iter [3650/6000] lr: 4.360e-03, eta: 0:29:07, time: 0.739, data_time: 0.304, memory: 7978, decode.loss_ce: 0.2526, decode.acc_seg: 90.1285, aux_0.loss_ce: 0.3876, aux_0.acc_seg: 86.2488, aux_1.loss_ce: 0.3207, aux_1.acc_seg: 88.0205, aux_2.loss_ce: 0.3574, aux_2.acc_seg: 84.9411, aux_3.loss_ce: 0.4482, aux_3.acc_seg: 79.0520, loss: 1.7664
2022-07-13 05:40:58,590 - mmseg - INFO - Iter [3700/6000] lr: 4.279e-03, eta: 0:28:29, time: 0.710, data_time: 0.276, memory: 7978, decode.loss_ce: 0.2537, decode.acc_seg: 90.0376, aux_0.loss_ce: 0.3896, aux_0.acc_seg: 86.1735, aux_1.loss_ce: 0.3228, aux_1.acc_seg: 87.8744, aux_2.loss_ce: 0.3622, aux_2.acc_seg: 84.6782, aux_3.loss_ce: 0.4531, aux_3.acc_seg: 78.7201, loss: 1.7814
2022-07-13 05:41:33,966 - mmseg - INFO - Iter [3750/6000] lr: 4.197e-03, eta: 0:27:51, time: 0.708, data_time: 0.273, memory: 7978, decode.loss_ce: 0.2489, decode.acc_seg: 90.3267, aux_0.loss_ce: 0.3862, aux_0.acc_seg: 86.3536, aux_1.loss_ce: 0.3177, aux_1.acc_seg: 88.2530, aux_2.loss_ce: 0.3564, aux_2.acc_seg: 85.1786, aux_3.loss_ce: 0.4450, aux_3.acc_seg: 79.3686, loss: 1.7543
2022-07-13 05:42:11,189 - mmseg - INFO - Iter [3800/6000] lr: 4.115e-03, eta: 0:27:13, time: 0.744, data_time: 0.310, memory: 7978, decode.loss_ce: 0.2491, decode.acc_seg: 90.2439, aux_0.loss_ce: 0.4027, aux_0.acc_seg: 85.8286, aux_1.loss_ce: 0.3309, aux_1.acc_seg: 87.7349, aux_2.loss_ce: 0.3590, aux_2.acc_seg: 84.9446, aux_3.loss_ce: 0.4413, aux_3.acc_seg: 79.3486, loss: 1.7830
2022-07-13 05:42:46,540 - mmseg - INFO - Iter [3850/6000] lr: 4.033e-03, eta: 0:26:35, time: 0.707, data_time: 0.272, memory: 7978, decode.loss_ce: 0.2555, decode.acc_seg: 90.0167, aux_0.loss_ce: 0.3989, aux_0.acc_seg: 85.8427, aux_1.loss_ce: 0.3296, aux_1.acc_seg: 87.6695, aux_2.loss_ce: 0.3643, aux_2.acc_seg: 84.7177, aux_3.loss_ce: 0.4525, aux_3.acc_seg: 78.7560, loss: 1.8007
2022-07-13 05:43:21,780 - mmseg - INFO - Iter [3900/6000] lr: 3.950e-03, eta: 0:25:57, time: 0.705, data_time: 0.270, memory: 7978, decode.loss_ce: 0.2559, decode.acc_seg: 89.9446, aux_0.loss_ce: 0.3918, aux_0.acc_seg: 86.1086, aux_1.loss_ce: 0.3286, aux_1.acc_seg: 87.7121, aux_2.loss_ce: 0.3675, aux_2.acc_seg: 84.4662, aux_3.loss_ce: 0.4551, aux_3.acc_seg: 78.5672, loss: 1.7989
ERROR: Unexpected segmentation fault encountered in worker.
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/iter_based_runner.py", line 32, in __next__
data = next(self.iter_loader)
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 363, in __next__
data = self._next_data()
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 964, in _next_data
raise StopIteration
StopIteration
Bug fix
I do not have a bug fix. However, I think there's something really interesting, which is that the validation checking (17 images every 500 iters in the run posted above) appears to have a lagging image. You can see that it processes up to 16/17, then does a whole bunch of other training, then processes 17, then crashes.
Actually though, I went back to a passing training and it also exhibits this behavior. This training run ran to completion:
[>>>>>>>>>>>>>>>>>>>>>>>>>>>> ] 15/17, 1.3 task/s, elapsed: 11s, ETA: 2s
[>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ] 16/17, 1.3 task/s, elapsed: 12s, ETA: 1s
2022-07-13 20:31:12,197 - mmseg - INFO - per class results:
2022-07-13 20:31:12,198 - mmseg - INFO -
+------------+-------+-------+
| Class | IoU | Acc |
+------------+-------+-------+
| background | 92.17 | 98.09 |
| vine | 56.17 | 63.13 |
| trunk | 63.2 | 75.52 |
| post | 67.57 | 77.68 |
| leaf | 34.1 | 39.26 |
| sign | 81.48 | 89.06 |
+------------+-------+-------+
2022-07-13 20:31:12,198 - mmseg - INFO - Summary:
2022-07-13 20:31:12,199 - mmseg - INFO -
+-------+-------+-------+
| aAcc | mIoU | mAcc |
+-------+-------+-------+
| 92.19 | 65.78 | 73.79 |
+-------+-------+-------+
2022-07-13 20:31:12,199 - mmseg - INFO - Exp name: model_config.py
2022-07-13 20:31:12,199 - mmseg - INFO - Iter(val) [17] aAcc: 0.9219, mIoU: 0.6578, mAcc: 0.7379, IoU.background: 0.9217, IoU.vine: 0.5617, IoU.trunk: 0.6320, IoU.post: 0.6757, IoU.leaf: 0.3410, IoU.sign: 0.8148, Acc.background: 0.9809, Acc.vine: 0.6313, Acc.trunk: 0.7552, Acc.post: 0.7768, Acc.leaf: 0.3926, Acc.sign: 0.8906
2022-07-13 20:31:51,521 - mmseg - INFO - Iter [1050/6000] lr: 8.428e-03, eta: 1:08:04, time: 1.036, data_time: 0.600, memory: 7978, decode.loss_ce: 0.2987, decode.acc_seg: 87.1081, aux_0.loss_ce: 0.3554, aux_0.acc_seg: 86.2538, aux_1.loss_ce: 0.3429, aux_1.acc_seg: 85.6464, aux_2.loss_ce: 0.4007, aux_2.acc_seg: 81.6827, aux_3.loss_ce: 0.4809, aux_3.acc_seg: 77.5501, loss: 1.8786
2022-07-13 20:32:33,179 - mmseg - INFO - Iter [1100/6000] lr: 8.352e-03, eta: 1:07:25, time: 0.833, data_time: 0.396, memory: 7978, decode.loss_ce: 0.3005, decode.acc_seg: 87.0382, aux_0.loss_ce: 0.3486, aux_0.acc_seg: 86.5057, aux_1.loss_ce: 0.3397, aux_1.acc_seg: 85.6771, aux_2.loss_ce: 0.4002, aux_2.acc_seg: 81.6403, aux_3.loss_ce: 0.4844, aux_3.acc_seg: 77.3029, loss: 1.8734
2022-07-13 20:33:12,453 - mmseg - INFO - Iter [1150/6000] lr: 8.276e-03, eta: 1:06:35, time: 0.785, data_time: 0.349, memory: 7978, decode.loss_ce: 0.3001, decode.acc_seg: 86.9343, aux_0.loss_ce: 0.3549, aux_0.acc_seg: 86.2750, aux_1.loss_ce: 0.3505, aux_1.acc_seg: 85.2047, aux_2.loss_ce: 0.4087, aux_2.acc_seg: 81.2553, aux_3.loss_ce: 0.4862, aux_3.acc_seg: 77.3371, loss: 1.9005
2022-07-13 20:33:51,638 - mmseg - INFO - Iter [1200/6000] lr: 8.200e-03, eta: 1:05:46, time: 0.784, data_time: 0.347, memory: 7978, decode.loss_ce: 0.2933, decode.acc_seg: 87.2231, aux_0.loss_ce: 0.3426, aux_0.acc_seg: 86.7179, aux_1.loss_ce: 0.3369, aux_1.acc_seg: 85.6778, aux_2.loss_ce: 0.3962, aux_2.acc_seg: 81.7005, aux_3.loss_ce: 0.4765, aux_3.acc_seg: 77.6125, loss: 1.8455
2022-07-13 20:34:33,427 - mmseg - INFO - Iter [1250/6000] lr: 8.124e-03, eta: 1:05:07, time: 0.836, data_time: 0.399, memory: 7978, decode.loss_ce: 0.2989, decode.acc_seg: 86.9466, aux_0.loss_ce: 0.3415, aux_0.acc_seg: 86.6343, aux_1.loss_ce: 0.3402, aux_1.acc_seg: 85.5249, aux_2.loss_ce: 0.3998, aux_2.acc_seg: 81.4783, aux_3.loss_ce: 0.4803, aux_3.acc_seg: 77.3864, loss: 1.8607
2022-07-13 20:35:13,031 - mmseg - INFO - Iter [1300/6000] lr: 8.048e-03, eta: 1:04:21, time: 0.792, data_time: 0.356, memory: 7978, decode.loss_ce: 0.2893, decode.acc_seg: 87.3710, aux_0.loss_ce: 0.3321, aux_0.acc_seg: 87.0348, aux_1.loss_ce: 0.3296, aux_1.acc_seg: 85.9881, aux_2.loss_ce: 0.3913, aux_2.acc_seg: 81.9107, aux_3.loss_ce: 0.4709, aux_3.acc_seg: 77.9124, loss: 1.8131
2022-07-13 20:35:52,299 - mmseg - INFO - Iter [1350/6000] lr: 7.972e-03, eta: 1:03:34, time: 0.785, data_time: 0.349, memory: 7978, decode.loss_ce: 0.2898, decode.acc_seg: 87.3075, aux_0.loss_ce: 0.3336, aux_0.acc_seg: 87.0116, aux_1.loss_ce: 0.3314, aux_1.acc_seg: 85.9027, aux_2.loss_ce: 0.3940, aux_2.acc_seg: 81.7306, aux_3.loss_ce: 0.4780, aux_3.acc_seg: 77.5172, loss: 1.8267
2022-07-13 20:36:33,848 - mmseg - INFO - Iter [1400/6000] lr: 7.896e-03, eta: 1:02:54, time: 0.831, data_time: 0.395, memory: 7978, decode.loss_ce: 0.2929, decode.acc_seg: 87.1985, aux_0.loss_ce: 0.3459, aux_0.acc_seg: 86.4321, aux_1.loss_ce: 0.3365, aux_1.acc_seg: 85.6730, aux_2.loss_ce: 0.3985, aux_2.acc_seg: 81.4676, aux_3.loss_ce: 0.4780, aux_3.acc_seg: 77.3928, loss: 1.8518
2022-07-13 20:37:13,182 - mmseg - INFO - Iter [1450/6000] lr: 7.820e-03, eta: 1:02:08, time: 0.787, data_time: 0.351, memory: 7978, decode.loss_ce: 0.2852, decode.acc_seg: 87.4626, aux_0.loss_ce: 0.3338, aux_0.acc_seg: 86.8644, aux_1.loss_ce: 0.3277, aux_1.acc_seg: 86.0022, aux_2.loss_ce: 0.3880, aux_2.acc_seg: 81.9574, aux_3.loss_ce: 0.4684, aux_3.acc_seg: 77.9577, loss: 1.8031
2022-07-13 20:37:52,526 - mmseg - INFO - Saving checkpoint at 1500 iterations
2022-07-13 20:37:52,811 - mmseg - INFO - Iter [1500/6000] lr: 7.743e-03, eta: 1:01:23, time: 0.794, data_time: 0.352, memory: 7978, decode.loss_ce: 0.2887, decode.acc_seg: 87.4694, aux_0.loss_ce: 0.3361, aux_0.acc_seg: 86.9807, aux_1.loss_ce: 0.3320, aux_1.acc_seg: 85.9744, aux_2.loss_ce: 0.3936, aux_2.acc_seg: 81.8781, aux_3.loss_ce: 0.4738, aux_3.acc_seg: 77.7531, loss: 1.8243
[>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>] 17/17, 1.4 task/s, elapsed: 12s, ETA: 0s[ ] 0/17, elapsed: 0s, ETA:
[>> ] 1/17, 0.3 task/s, elapsed: 3s, ETA: 47s
Thanks for your feedback. Perhaps it is caused by your local limited computational resources on large 2048x2448 shape images. Does this error happened on smaller size dataset?
That shouldn't be an issue, the network is not actually training on the full images. In the dataset augmentation I have
crop_size = (480, 512)
...
train_pipeline = [
...
dict(type="RandomCrop", crop_size=crop_size, cat_max_ratio=0.75),
...
]
I've also checked that the training images actually use this smaller size. Sorry for not making that clear in the initial post.
Here's another example from this morning where ERROR: Unexpected segmentation fault encountered in worker. appears the same but the context is different.
2022-07-22 06:26:03,194 - mmseg - INFO - Iter [1050/6000] lr: 8.428e-03, eta: 1:01:23, time: 2.551, data_time: 2.105, memory: 7978, decode.loss_ce: 0.3697, decode.acc_seg: 85.0045, aux_0.loss_ce: 0.4535, aux_0.acc_seg: 83.6738, aux_1.loss_ce: 0.4258, aux_1.acc_seg: 83.2825, aux_2.loss_ce: 0.4692, aux_2.acc_seg: 79.8277, aux_3.loss_ce: 0.5278, aux_3.acc_seg: 76.5396, loss: 2.2459
ERROR: Unexpected segmentation fault encountered in worker.
Traceback (most recent call last):
File "tools/train.py", line 242, in <module>
main()
File "tools/train.py", line 238, in main
meta=meta)
File "/mmsegmentation/mmseg/apis/train.py", line 194, in train_segmentor
runner.run(data_loaders, cfg.workflow)
File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/iter_based_runner.py", line 133, in run
iter_runner(iter_loaders[i], **kwargs)
File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/iter_based_runner.py", line 60, in train
outputs = self.model.train_step(data_batch, self.optimizer, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/mmcv/parallel/data_parallel.py", line 67, in train_step
return self.module.train_step(*inputs[0], **kwargs[0])
File "/mmsegmentation/mmseg/models/segmentors/base.py", line 138, in train_step
losses = self(**data_batch)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/fp16_utils.py", line 98, in new_func
return old_func(*args, **kwargs)
File "/mmsegmentation/mmseg/models/segmentors/base.py", line 108, in forward
return self.forward_train(img, img_metas, **kwargs)
File "/mmsegmentation/mmseg/models/segmentors/encoder_decoder.py", line 144, in forward_train
gt_semantic_seg)
File "/mmsegmentation/mmseg/models/segmentors/encoder_decoder.py", line 88, in _decode_head_forward_train
self.train_cfg)
File "/mmsegmentation/mmseg/models/decode_heads/decode_head.py", line 204, in forward_train
losses = self.losses(seg_logits, gt_semantic_seg)
File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/fp16_utils.py", line 186, in new_func
return old_func(*args, **kwargs)
File "/mmsegmentation/mmseg/models/decode_heads/decode_head.py", line 265, in losses
seg_logit, seg_label, ignore_index=self.ignore_index)
File "/mmsegmentation/mmseg/models/losses/accuracy.py", line 49, in accuracy
correct = correct[:, target != ignore_index]
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 3234) is killed by signal: Segmentation fault.
[>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>] 153/153, 1.5 task/s, elapsed: 99s, ETA: 0sFailed to detect content-type automatically for artifact /home/eric/Desktop/SEMSEGTEST/WORKDIR_1658470371250420/20220722_061257.log.
Added application/json as content-type of artifact /home/eric/Desktop/SEMSEGTEST/WORKDIR_1658470371250420/20220722_061257.log.json.
The project where I was running into these is no longer active, so I don't have any new information. If anyone has any ideas feel free to post, but I'll close this now.