Training object365 got error "Killing subprocess"
When I use DINO train on object365 datasets,,it firstly runing well , but after more than 12hours training I got this error info bellow, so I want to know have you ever met this before? And could you tell me how to fix it ?
p.s. I use 8 V100 gpus training on object365, python=3.8, torch=1.8.1, pycocotools=2.0.
Epoch: [0] [ 64640/108893] eta: 7:40:38 lr: 0.000100 class_error: 19.73 loss: 11.1260 (13.3494) loss_bbox: 0.1232 (0.1641) loss_bbox_0: 0.1365 (0.1752) loss_bbox_1: 0.1301 (0.1708) loss_bbox_2: 0.1244 (0.1673) loss_bbox_3: 0.1233 (0.1655) loss_bbox_4: 0.1226 (0.1645) loss_bbox_dn: 0.2098 (0.2629) loss_bbox_dn_0: 0.2963 (0.3500) loss_bbox_dn_1: 0.2350 (0.2882) loss_bbox_dn_2: 0.2177 (0.2704) loss_bbox_dn_3: 0.2104 (0.2645) loss_bbox_dn_4: 0.2098 (0.2629) loss_bbox_interm: 0.1642 (0.2025) loss_ce: 0.3113 (0.3783) loss_ce_0: 0.3550 (0.4337) loss_ce_1: 0.3328 (0.4030) loss_ce_2: 0.3225 (0.3888) loss_ce_3: 0.3157 (0.3825) loss_ce_4: 0.3107 (0.3794) loss_ce_dn: 0.0431 (0.0728) loss_ce_dn_0: 0.0883 (0.1183) loss_ce_dn_1: 0.0574 (0.0849) loss_ce_dn_2: 0.0497 (0.0763) loss_ce_dn_3: 0.0455 (0.0726) loss_ce_dn_4: 0.0440 (0.0723) loss_ce_interm: 0.3534 (0.4208) loss_giou: 0.4256 (0.5024) loss_giou_0: 0.4508 (0.5235) loss_giou_1: 0.4424 (0.5159) loss_giou_2: 0.4299 (0.5090) loss_giou_3: 0.4275 (0.5051) loss_giou_4: 0.4300 (0.5031) loss_giou_dn: 0.4571 (0.5503) loss_giou_dn_0: 0.6208 (0.7057) loss_giou_dn_1: 0.4956 (0.5955) loss_giou_dn_2: 0.4696 (0.5647) loss_giou_dn_3: 0.4604 (0.5542) loss_giou_dn_4: 0.4574 (0.5500) loss_giou_interm: 0.4875 (0.5776) cardinality_error_unscaled: 887.8750 (888.2050) cardinality_error_0_unscaled: 887.8750 (888.2049) cardinality_error_1_unscaled: 887.8750 (888.2022) cardinality_error_2_unscaled: 887.8750 (888.2043) cardinality_error_3_unscaled: 887.8750 (888.2048) cardinality_error_4_unscaled: 887.8750 (888.2048) cardinality_error_dn_unscaled: 173.6250 (174.9006) cardinality_error_dn_0_unscaled: 173.6250 (174.8984) cardinality_error_dn_1_unscaled: 173.6250 (174.9005) cardinality_error_dn_2_unscaled: 173.6250 (174.9005) cardinality_error_dn_3_unscaled: 173.6250 (174.9006) cardinality_error_dn_4_unscaled: 173.6250 (174.9005) cardinality_error_interm_unscaled: 887.8750 (888.2050) class_error_unscaled: 16.8622 (25.0795) loss_bbox_unscaled: 0.0246 (0.0328) loss_bbox_0_unscaled: 0.0273 (0.0350) loss_bbox_1_unscaled: 0.0260 (0.0342) loss_bbox_2_unscaled: 0.0249 (0.0335) loss_bbox_3_unscaled: 0.0247 (0.0331) loss_bbox_4_unscaled: 0.0245 (0.0329) loss_bbox_dn_unscaled: 0.0420 (0.0526) loss_bbox_dn_0_unscaled: 0.0593 (0.0700) loss_bbox_dn_1_unscaled: 0.0470 (0.0576) loss_bbox_dn_2_unscaled: 0.0435 (0.0541) loss_bbox_dn_3_unscaled: 0.0421 (0.0529) loss_bbox_dn_4_unscaled: 0.0420 (0.0526) loss_bbox_interm_unscaled: 0.0328 (0.0405) loss_ce_unscaled: 0.3113 (0.3783) loss_ce_0_unscaled: 0.3550 (0.4337) loss_ce_1_unscaled: 0.3328 (0.4030) loss_ce_2_unscaled: 0.3225 (0.3888) loss_ce_3_unscaled: 0.3157 (0.3825) loss_ce_4_unscaled: 0.3107 (0.3794) loss_ce_dn_unscaled: 0.0431 (0.0728) loss_ce_dn_0_unscaled: 0.0883 (0.1183) loss_ce_dn_1_unscaled: 0.0574 (0.0849) loss_ce_dn_2_unscaled: 0.0497 (0.0763) loss_ce_dn_3_unscaled: 0.0455 (0.0726) loss_ce_dn_4_unscaled: 0.0440 (0.0723) loss_ce_interm_unscaled: 0.3534 (0.4208) loss_giou_unscaled: 0.2128 (0.2512) loss_giou_0_unscaled: 0.2254 (0.2617) loss_giou_1_unscaled: 0.2212 (0.2579) loss_giou_2_unscaled: 0.2150 (0.2545) loss_giou_3_unscaled: 0.2137 (0.2526) loss_giou_4_unscaled: 0.2150 (0.2516) loss_giou_dn_unscaled: 0.2286 (0.2751) loss_giou_dn_0_unscaled: 0.3104 (0.3529) loss_giou_dn_1_unscaled: 0.2478 (0.2978) loss_giou_dn_2_unscaled: 0.2348 (0.2824) loss_giou_dn_3_unscaled: 0.2302 (0.2771) loss_giou_dn_4_unscaled: 0.2287 (0.2750) loss_giou_interm_unscaled: 0.2438 (0.2888) loss_hw_unscaled: 0.0164 (0.0216) loss_hw_0_unscaled: 0.0180 (0.0231) loss_hw_1_unscaled: 0.0179 (0.0225) loss_hw_2_unscaled: 0.0162 (0.0220) loss_hw_3_unscaled: 0.0163 (0.0218) loss_hw_4_unscaled: 0.0164 (0.0217) loss_hw_dn_unscaled: 0.0268 (0.0345) loss_hw_dn_0_unscaled: 0.0392 (0.0470) loss_hw_dn_1_unscaled: 0.0296 (0.0380) loss_hw_dn_2_unscaled: 0.0277 (0.0355) loss_hw_dn_3_unscaled: 0.0271 (0.0347) loss_hw_dn_4_unscaled: 0.0268 (0.0345) loss_hw_interm_unscaled: 0.0220 (0.0270) loss_xy_unscaled: 0.0089 (0.0112) loss_xy_0_unscaled: 0.0094 (0.0119) loss_xy_1_unscaled: 0.0094 (0.0117) loss_xy_2_unscaled: 0.0088 (0.0114) loss_xy_3_unscaled: 0.0087 (0.0113) loss_xy_4_unscaled: 0.0088 (0.0112) loss_xy_dn_unscaled: 0.0147 (0.0181) loss_xy_dn_0_unscaled: 0.0197 (0.0230) loss_xy_dn_1_unscaled: 0.0166 (0.0197) loss_xy_dn_2_unscaled: 0.0152 (0.0186) loss_xy_dn_3_unscaled: 0.0148 (0.0182) loss_xy_dn_4_unscaled: 0.0147 (0.0181) loss_xy_interm_unscaled: 0.0109 (0.0135) time: 0.6279 data: 0.0092 max mem: 16337
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 340, in
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
Killing subprocess 1018893 Killing subprocess 1018895 Killing subprocess 1018896 Killing subprocess 1018897 Killing subprocess 1018898 Killing subprocess 1018901 Killing subprocess 1018903 Killing subprocess 1018905
Hey, we did not meet this problem before.
It seems your program is killed. Maybe someone killed your job or you did not allocate enough resources.