PaddleDetection
PaddleDetection copied to clipboard
训练一段时间报错
问题确认 Search before asking
- [X] 我已经搜索过问题,但是没有找到解答。I have searched the question and found no related answer.
请提出你的问题 Please ask your question
您好,我使用ppyoloe训练m的模型,总是训练一段时间后就报错中断了,请问是什么原因?
workerlog.0日志:
/home/conda/envs/paddle/lib/python3.8/site-packages/paddle/vision/transforms/functional_pil.py:36: DeprecationWarning: NEAREST is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.NEAREST or Dither.NONE instead.
'nearest': Image.NEAREST,
/home/conda/envs/paddle/lib/python3.8/site-packages/paddle/vision/transforms/functional_pil.py:37: DeprecationWarning: BILINEAR is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BILINEAR instead.
'bilinear': Image.BILINEAR,
/home/conda/envs/paddle/lib/python3.8/site-packages/paddle/vision/transforms/functional_pil.py:38: DeprecationWarning: BICUBIC is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BICUBIC instead.
'bicubic': Image.BICUBIC,
/home/conda/envs/paddle/lib/python3.8/site-packages/paddle/vision/transforms/functional_pil.py:39: DeprecationWarning: BOX is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BOX instead.
'box': Image.BOX,
/home/conda/envs/paddle/lib/python3.8/site-packages/paddle/vision/transforms/functional_pil.py:40: DeprecationWarning: LANCZOS is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.LANCZOS instead.
'lanczos': Image.LANCZOS,
/home/conda/envs/paddle/lib/python3.8/site-packages/paddle/vision/transforms/functional_pil.py:41: DeprecationWarning: HAMMING is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.HAMMING instead.
'hamming': Image.HAMMING
/home/conda/envs/paddle/lib/python3.8/site-packages/paddle/tensor/creation.py:130: DeprecationWarning: np.object is a deprecated alias for the builtin object. To silence this warning, use object by itself. Doing this will not modify any behavior and is safe.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
if data.dtype == np.object:
server not ready, wait 3 sec to retry...
not ready endpoints:['127.0.0.1:34215', '127.0.0.1:24180', '127.0.0.1:12949', '127.0.0.1:48926']
I0805 16:50:14.215762 59132 nccl_context.cc:74] init nccl context nranks: 8 local rank: 0 gpu id: 0 ring id: 0
W0805 16:50:18.023058 59132 device_context.cc:447] Please NOTE: device: 0, GPU Compute Capability: 8.0, Driver API Version: 11.3, Runtime API Version: 11.0
W0805 16:50:18.056371 59132 device_context.cc:465] device: 0, cuDNN Version: 8.2.
loading annotations into memory...
Done (t=7.63s)
creating index...
index created!
[08/05 16:55:41] ppdet.utils.checkpoint INFO: Finish resuming model weights: output/ppyoloe_crn_m_300e_coco/3.pdparams
[08/05 16:56:04] ppdet.engine INFO: Epoch: [4] [ 0/756] learning_rate: 0.017600 loss: 2.322393 loss_cls: 1.170200 loss_iou: 0.232667 loss_dfl: 1.141054 loss_l1: 0.333063 eta: 51 days, 12:48:16 batch_cost: 19.8971 data_cost: 6.6981 ips: 0.9549 images/s
[08/05 16:59:14] ppdet.engine INFO: Epoch: [4] [100/756] learning_rate: 0.018182 loss: 1.760788 loss_cls: 0.847727 loss_iou: 0.197081 loss_dfl: 0.838693 loss_l1: 0.336464 eta: 5 days, 3:06:07 batch_cost: 1.8021 data_cost: 0.1008 ips: 10.5431 images/s
[08/05 17:02:21] ppdet.engine INFO: Epoch: [4] [200/756] learning_rate: 0.018764 loss: 1.726523 loss_cls: 0.825469 loss_iou: 0.199056 loss_dfl: 0.842408 loss_l1: 0.340040 eta: 4 days, 20:39:39 batch_cost: 1.7746 data_cost: 0.0278 ips: 10.7066 images/s
[08/05 17:05:30] ppdet.engine INFO: Epoch: [4] [300/756] learning_rate: 0.019346 loss: 1.709760 loss_cls: 0.812167 loss_iou: 0.195821 loss_dfl: 0.852661 loss_l1: 0.335590 eta: 4 days, 18:56:20 batch_cost: 1.7975 data_cost: 0.0628 ips: 10.5702 images/s
[08/05 17:08:31] ppdet.engine INFO: Epoch: [4] [400/756] learning_rate: 0.019928 loss: 1.515878 loss_cls: 0.726681 loss_iou: 0.170745 loss_dfl: 0.748486 loss_l1: 0.330758 eta: 4 days, 16:55:18 batch_cost: 1.7245 data_cost: 0.2093 ips: 11.0175 images/s
[08/05 17:11:42] ppdet.engine INFO: Epoch: [4] [500/756] learning_rate: 0.020510 loss: 1.645922 loss_cls: 0.773220 loss_iou: 0.188289 loss_dfl: 0.800534 loss_l1: 0.341043 eta: 4 days, 16:49:19 batch_cost: 1.8159 data_cost: 0.1056 ips: 10.4631 images/s
[08/05 17:14:36] ppdet.engine INFO: Epoch: [4] [600/756] learning_rate: 0.021092 loss: 1.519515 loss_cls: 0.728885 loss_iou: 0.177556 loss_dfl: 0.734887 loss_l1: 0.336475 eta: 4 days, 15:01:04 batch_cost: 1.6491 data_cost: 0.1961 ips: 11.5215 images/s
[08/05 17:17:31] ppdet.engine INFO: Epoch: [4] [700/756] learning_rate: 0.021674 loss: 1.520809 loss_cls: 0.710189 loss_iou: 0.172625 loss_dfl: 0.745666 loss_l1: 0.351451 eta: 4 days, 13:54:34 batch_cost: 1.6710 data_cost: 0.0336 ips: 11.3701 images/s
[08/05 17:19:16] ppdet.engine INFO: Epoch: [5] [ 0/756] learning_rate: 0.022000 loss: 1.642954 loss_cls: 0.767144 loss_iou: 0.187482 loss_dfl: 0.795157 loss_l1: 0.356641 eta: 4 days, 13:51:06 batch_cost: 1.7064 data_cost: 0.1237 ips: 11.1348 images/s
[08/05 17:21:54] ppdet.engine INFO: Epoch: [5] [100/756] learning_rate: 0.022000 loss: 1.675558 loss_cls: 0.771206 loss_iou: 0.189614 loss_dfl: 0.793927 loss_l1: 0.345280 eta: 4 days, 11:48:06 batch_cost: 1.4963 data_cost: 0.0231 ips: 12.6979 images/s
[08/05 17:24:36] ppdet.engine INFO: Epoch: [5] [200/756] learning_rate: 0.022000 loss: 1.663341 loss_cls: 0.776074 loss_iou: 0.178256 loss_dfl: 0.807338 loss_l1: 0.341900 eta: 4 days, 10:30:15 batch_cost: 1.5478 data_cost: 0.0205 ips: 12.2755 images/s
[08/05 17:27:12] ppdet.engine INFO: Epoch: [5] [300/756] learning_rate: 0.022000 loss: 1.563646 loss_cls: 0.718987 loss_iou: 0.177160 loss_dfl: 0.765810 loss_l1: 0.331434 eta: 4 days, 9:01:15 batch_cost: 1.4755 data_cost: 0.1710 ips: 12.8772 images/s
[08/05 17:29:49] ppdet.engine INFO: Epoch: [5] [400/756] learning_rate: 0.022000 loss: 1.507682 loss_cls: 0.705642 loss_iou: 0.167345 loss_dfl: 0.724220 loss_l1: 0.341726 eta: 4 days, 7:54:34 batch_cost: 1.4984 data_cost: 0.0541 ips: 12.6803 images/s
[08/05 17:32:25] ppdet.engine INFO: Epoch: [5] [500/756] learning_rate: 0.022000 loss: 1.534397 loss_cls: 0.728890 loss_iou: 0.169574 loss_dfl: 0.734923 loss_l1: 0.320769 eta: 4 days, 6:52:29 batch_cost: 1.4794 data_cost: 0.2584 ips: 12.8429 images/s
[08/05 17:35:28] ppdet.engine INFO: Epoch: [5] [600/756] learning_rate: 0.022000 loss: 1.749256 loss_cls: 0.806570 loss_iou: 0.192514 loss_dfl: 0.865929 loss_l1: 0.340179 eta: 4 days, 7:15:04 batch_cost: 1.7572 data_cost: 0.0288 ips: 10.8128 images/s
[08/05 17:38:03] ppdet.engine INFO: Epoch: [5] [700/756] learning_rate: 0.022000 loss: 1.585816 loss_cls: 0.737731 loss_iou: 0.187380 loss_dfl: 0.764219 loss_l1: 0.323577 eta: 4 days, 6:21:43 batch_cost: 1.4723 data_cost: 0.0173 ips: 12.9049 images/s
[08/05 17:39:22] ppdet.utils.checkpoint INFO: Save checkpoint: output/ppyoloe_crn_m_300e_coco
[08/05 17:39:38] ppdet.engine INFO: Epoch: [6] [ 0/756] learning_rate: 0.022000 loss: 1.617880 loss_cls: 0.755030 loss_iou: 0.187380 loss_dfl: 0.781778 loss_l1: 0.327239 eta: 4 days, 6:08:05 batch_cost: 1.5398 data_cost: 0.1657 ips: 12.3394 images/s
[08/05 17:42:23] ppdet.engine INFO: Epoch: [6] [100/756] learning_rate: 0.021999 loss: 1.594169 loss_cls: 0.751437 loss_iou: 0.177114 loss_dfl: 0.773430 loss_l1: 0.330835 eta: 4 days, 5:47:20 batch_cost: 1.5759 data_cost: 0.3786 ips: 12.0566 images/s
[08/05 17:45:11] ppdet.engine INFO: Epoch: [6] [200/756] learning_rate: 0.021999 loss: 1.540644 loss_cls: 0.748565 loss_iou: 0.173193 loss_dfl: 0.767030 loss_l1: 0.326987 eta: 4 days, 5:37:00 batch_cost: 1.6143 data_cost: 0.0146 ips: 11.7697 images/s
[08/05 17:47:50] ppdet.engine INFO: Epoch: [6] [300/756] learning_rate: 0.021999 loss: 1.562756 loss_cls: 0.730502 loss_iou: 0.178442 loss_dfl: 0.738331 loss_l1: 0.326924 eta: 4 days, 5:07:01 batch_cost: 1.5139 data_cost: 0.0542 ips: 12.5504 images/s
[08/05 17:50:43] ppdet.engine INFO: Epoch: [6] [400/756] learning_rate: 0.021999 loss: 1.675293 loss_cls: 0.800194 loss_iou: 0.192313 loss_dfl: 0.822982 loss_l1: 0.341401 eta: 4 days, 5:08:09 batch_cost: 1.6600 data_cost: 0.0936 ips: 11.4458 images/s
[08/05 17:53:23] ppdet.engine INFO: Epoch: [6] [500/756] learning_rate: 0.021999 loss: 1.759085 loss_cls: 0.827808 loss_iou: 0.207623 loss_dfl: 0.832703 loss_l1: 0.333490 eta: 4 days, 4:44:22 batch_cost: 1.5264 data_cost: 0.1019 ips: 12.4476 images/s
[08/05 17:56:18] ppdet.engine INFO: Epoch: [6] [600/756] learning_rate: 0.021999 loss: 1.682204 loss_cls: 0.783615 loss_iou: 0.188381 loss_dfl: 0.800629 loss_l1: 0.327185 eta: 4 days, 4:49:19 batch_cost: 1.6792 data_cost: 0.0272 ips: 11.3146 images/s
[08/05 17:58:54] ppdet.engine INFO: Epoch: [6] [700/756] learning_rate: 0.021998 loss: 1.581580 loss_cls: 0.766178 loss_iou: 0.180530 loss_dfl: 0.769263 loss_l1: 0.318792 eta: 4 days, 4:21:09 batch_cost: 1.4850 data_cost: 0.0422 ips: 12.7944 images/s
[08/05 18:00:40] ppdet.engine INFO: Epoch: [7] [ 0/756] learning_rate: 0.021998 loss: 1.551912 loss_cls: 0.737773 loss_iou: 0.181749 loss_dfl: 0.760793 loss_l1: 0.320199 eta: 4 days, 4:34:15 batch_cost: 1.6366 data_cost: 0.2225 ips: 11.6097 images/s
[08/05 18:03:50] ppdet.engine INFO: Epoch: [7] [100/756] learning_rate: 0.021998 loss: 1.595082 loss_cls: 0.750081 loss_iou: 0.178533 loss_dfl: 0.807499 loss_l1: 0.310898 eta: 4 days, 5:02:34 batch_cost: 1.8338 data_cost: 0.0317 ips: 10.3611 images/s
[08/05 18:06:29] ppdet.engine INFO: Epoch: [7] [200/756] learning_rate: 0.021998 loss: 1.591978 loss_cls: 0.723541 loss_iou: 0.172927 loss_dfl: 0.752688 loss_l1: 0.327044 eta: 4 days, 4:40:55 batch_cost: 1.5163 data_cost: 0.1096 ips: 12.5305 images/s
[08/05 18:09:21] ppdet.engine INFO: Epoch: [7] [300/756] learning_rate: 0.021998 loss: 1.623348 loss_cls: 0.760549 loss_iou: 0.189315 loss_dfl: 0.781861 loss_l1: 0.309717 eta: 4 days, 4:39:30 batch_cost: 1.6470 data_cost: 0.0184 ips: 11.5363 images/s
[08/05 18:12:04] ppdet.engine INFO: Epoch: [7] [400/756] learning_rate: 0.021997 loss: 1.588164 loss_cls: 0.748990 loss_iou: 0.173227 loss_dfl: 0.763723 loss_l1: 0.314628 eta: 4 days, 4:26:44 batch_cost: 1.5654 data_cost: 0.0348 ips: 12.1374 images/s
[08/05 18:14:46] ppdet.engine INFO: Epoch: [7] [500/756] learning_rate: 0.021997 loss: 1.551833 loss_cls: 0.747872 loss_iou: 0.173597 loss_dfl: 0.768164 loss_l1: 0.323995 eta: 4 days, 4:13:08 batch_cost: 1.5538 data_cost: 0.0369 ips: 12.2284 images/s
[08/05 18:17:35] ppdet.engine INFO: Epoch: [7] [600/756] learning_rate: 0.021997 loss: 1.573492 loss_cls: 0.744613 loss_iou: 0.171412 loss_dfl: 0.780535 loss_l1: 0.329639 eta: 4 days, 4:07:42 batch_cost: 1.6113 data_cost: 0.0168 ips: 11.7916 images/s
[08/05 18:20:21] ppdet.engine INFO: Epoch: [7] [700/756] learning_rate: 0.021996 loss: 1.715485 loss_cls: 0.809767 loss_iou: 0.185178 loss_dfl: 0.803900 loss_l1: 0.304498 eta: 4 days, 3:59:52 batch_cost: 1.5904 data_cost: 0.0126 ips: 11.9469 images/s
[08/05 18:21:54] ppdet.utils.checkpoint INFO: Save checkpoint: output/ppyoloe_crn_m_300e_coco
[08/05 18:22:08] ppdet.engine INFO: Epoch: [8] [ 0/756] learning_rate: 0.021996 loss: 1.723679 loss_cls: 0.817825 loss_iou: 0.193669 loss_dfl: 0.834135 loss_l1: 0.315692 eta: 4 days, 4:08:55 batch_cost: 1.6621 data_cost: 0.1058 ips: 11.4313 images/s
[08/05 18:24:50] ppdet.engine INFO: Epoch: [8] [100/756] learning_rate: 0.021996 loss: 1.556847 loss_cls: 0.759977 loss_iou: 0.176902 loss_dfl: 0.766240 loss_l1: 0.314873 eta: 4 days, 3:54:45 batch_cost: 1.5359 data_cost: 0.0554 ips: 12.3704 images/s
[08/05 18:27:48] ppdet.engine INFO: Epoch: [8] [200/756] learning_rate: 0.021995 loss: 1.606704 loss_cls: 0.748175 loss_iou: 0.177527 loss_dfl: 0.794344 loss_l1: 0.309099 eta: 4 days, 4:01:00 batch_cost: 1.7087 data_cost: 0.0102 ips: 11.1193 images/s
[08/05 18:30:25] ppdet.engine INFO: Epoch: [8] [300/756] learning_rate: 0.021995 loss: 1.638346 loss_cls: 0.744690 loss_iou: 0.188172 loss_dfl: 0.809123 loss_l1: 0.316546 eta: 4 days, 3:43:20 batch_cost: 1.4973 data_cost: 0.0144 ips: 12.6891 images/s
[08/05 18:33:10] ppdet.engine INFO: Epoch: [8] [400/756] learning_rate: 0.021995 loss: 1.661588 loss_cls: 0.798166 loss_iou: 0.192623 loss_dfl: 0.810671 loss_l1: 0.324218 eta: 4 days, 3:35:02 batch_cost: 1.5763 data_cost: 0.0203 ips: 12.0536 images/s
[08/05 18:35:52] ppdet.engine INFO: Epoch: [8] [500/756] learning_rate: 0.021994 loss: 1.605274 loss_cls: 0.749992 loss_iou: 0.187279 loss_dfl: 0.761527 loss_l1: 0.320901 eta: 4 days, 3:24:02 batch_cost: 1.5475 data_cost: 0.0113 ips: 12.2781 images/s
[08/05 18:38:29] ppdet.engine INFO: Epoch: [8] [600/756] learning_rate: 0.021994 loss: 1.703462 loss_cls: 0.799115 loss_iou: 0.199279 loss_dfl: 0.826031 loss_l1: 0.326202 eta: 4 days, 3:08:05 batch_cost: 1.4937 data_cost: 0.0186 ips: 12.7198 images/s
[08/05 18:40:59] ppdet.engine INFO: Epoch: [8] [700/756] learning_rate: 0.021993 loss: 1.466967 loss_cls: 0.673969 loss_iou: 0.172898 loss_dfl: 0.715640 loss_l1: 0.329912 eta: 4 days, 2:46:28 batch_cost: 1.4291 data_cost: 0.0134 ips: 13.2949 images/s
[08/05 18:42:31] ppdet.engine INFO: Epoch: [9] [ 0/756] learning_rate: 0.021993 loss: 1.621051 loss_cls: 0.753548 loss_iou: 0.184051 loss_dfl: 0.772653 loss_l1: 0.335265 eta: 4 days, 2:40:44 batch_cost: 1.5375 data_cost: 0.0808 ips: 12.3578 images/s
[08/05 18:45:15] ppdet.engine INFO: Epoch: [9] [100/756] learning_rate: 0.021993 loss: 1.605355 loss_cls: 0.760110 loss_iou: 0.179274 loss_dfl: 0.773499 loss_l1: 0.310461 eta: 4 days, 2:33:00 batch_cost: 1.5614 data_cost: 0.0129 ips: 12.1683 images/s
[08/05 18:47:48] ppdet.engine INFO: Epoch: [9] [200/756] learning_rate: 0.021992 loss: 1.538047 loss_cls: 0.727057 loss_iou: 0.172175 loss_dfl: 0.740262 loss_l1: 0.314967 eta: 4 days, 2:15:45 batch_cost: 1.4551 data_cost: 0.0120 ips: 13.0573 images/s
[08/05 18:50:22] ppdet.engine INFO: Epoch: [9] [300/756] learning_rate: 0.021992 loss: 1.515745 loss_cls: 0.723279 loss_iou: 0.169937 loss_dfl: 0.724049 loss_l1: 0.303068 eta: 4 days, 2:00:32 batch_cost: 1.4696 data_cost: 0.1026 ips: 12.9283 images/s
[08/05 18:53:04] ppdet.engine INFO: Epoch: [9] [400/756] learning_rate: 0.021991 loss: 1.426154 loss_cls: 0.701391 loss_iou: 0.165431 loss_dfl: 0.738665 loss_l1: 0.317456 eta: 4 days, 1:52:04 batch_cost: 1.5398 data_cost: 0.0451 ips: 12.3389 images/s
[08/05 18:55:37] ppdet.engine INFO: Epoch: [9] [500/756] learning_rate: 0.021991 loss: 1.579743 loss_cls: 0.747759 loss_iou: 0.173331 loss_dfl: 0.779101 loss_l1: 0.317948 eta: 4 days, 1:36:47 batch_cost: 1.4569 data_cost: 0.0142 ips: 13.0415 images/s
[08/05 18:58:15] ppdet.engine INFO: Epoch: [9] [600/756] learning_rate: 0.021990 loss: 1.718383 loss_cls: 0.758945 loss_iou: 0.185399 loss_dfl: 0.816455 loss_l1: 0.317968 eta: 4 days, 1:26:18 batch_cost: 1.5073 data_cost: 0.0167 ips: 12.6053 images/s
[08/05 19:00:51] ppdet.engine INFO: Epoch: [9] [700/756] learning_rate: 0.021990 loss: 1.552198 loss_cls: 0.730863 loss_iou: 0.165452 loss_dfl: 0.765627 loss_l1: 0.309582 eta: 4 days, 1:14:41 batch_cost: 1.4891 data_cost: 0.0153 ips: 12.7590 images/s
[08/05 19:02:14] ppdet.utils.checkpoint INFO: Save checkpoint: output/ppyoloe_crn_m_300e_coco
[08/05 19:02:33] ppdet.engine INFO: Epoch: [10] [ 0/756] learning_rate: 0.021989 loss: 1.523439 loss_cls: 0.722765 loss_iou: 0.167792 loss_dfl: 0.754635 loss_l1: 0.317419 eta: 4 days, 1:17:11 batch_cost: 1.6005 data_cost: 0.1180 ips: 11.8715 images/s
[08/05 19:05:12] ppdet.engine INFO: Epoch: [10] [100/756] learning_rate: 0.021989 loss: 1.375348 loss_cls: 0.657894 loss_iou: 0.154710 loss_dfl: 0.694749 loss_l1: 0.319167 eta: 4 days, 1:08:11 batch_cost: 1.5169 data_cost: 0.0414 ips: 12.5252 images/s
[08/05 19:07:44] ppdet.engine INFO: Epoch: [10] [200/756] learning_rate: 0.021988 loss: 1.506416 loss_cls: 0.706033 loss_iou: 0.169383 loss_dfl: 0.725946 loss_l1: 0.313098 eta: 4 days, 0:53:46 batch_cost: 1.4432 data_cost: 0.1488 ips: 13.1652 images/s
[08/05 19:10:20] ppdet.engine INFO: Epoch: [10] [300/756] learning_rate: 0.021987 loss: 1.543183 loss_cls: 0.724108 loss_iou: 0.181722 loss_dfl: 0.759843 loss_l1: 0.301185 eta: 4 days, 0:42:41 batch_cost: 1.4808 data_cost: 0.0601 ips: 12.8311 images/s
[08/05 19:12:51] ppdet.engine INFO: Epoch: [10] [400/756] learning_rate: 0.021987 loss: 1.560220 loss_cls: 0.743759 loss_iou: 0.170398 loss_dfl: 0.761391 loss_l1: 0.308441 eta: 4 days, 0:28:48 batch_cost: 1.4382 data_cost: 0.0271 ips: 13.2112 images/s
[08/05 19:15:25] ppdet.engine INFO: Epoch: [10] [500/756] learning_rate: 0.021986 loss: 1.694777 loss_cls: 0.817188 loss_iou: 0.190030 loss_dfl: 0.822628 loss_l1: 0.316269 eta: 4 days, 0:17:32 batch_cost: 1.4681 data_cost: 0.0132 ips: 12.9423 images/s
[08/05 19:17:55] ppdet.engine INFO: Epoch: [10] [600/756] learning_rate: 0.021986 loss: 1.744348 loss_cls: 0.827604 loss_iou: 0.192651 loss_dfl: 0.836215 loss_l1: 0.318675 eta: 4 days, 0:03:21 batch_cost: 1.4221 data_cost: 0.0183 ips: 13.3604 images/s
[08/05 19:20:34] ppdet.engine INFO: Epoch: [10] [700/756] learning_rate: 0.021985 loss: 1.512237 loss_cls: 0.709934 loss_iou: 0.160450 loss_dfl: 0.722271 loss_l1: 0.311786 eta: 3 days, 23:56:02 batch_cost: 1.5143 data_cost: 0.0570 ips: 12.5474 images/s
[08/05 19:22:17] ppdet.engine INFO: Epoch: [11] [ 0/756] learning_rate: 0.021984 loss: 1.673549 loss_cls: 0.792350 loss_iou: 0.192116 loss_dfl: 0.820863 loss_l1: 0.311159 eta: 3 days, 23:59:59 batch_cost: 1.5877 data_cost: 0.2152 ips: 11.9668 images/s
[08/05 19:25:29] ppdet.engine INFO: Epoch: [11] [100/756] learning_rate: 0.021984 loss: 1.440710 loss_cls: 0.689467 loss_iou: 0.157232 loss_dfl: 0.703341 loss_l1: 0.298309 eta: 4 days, 0:14:25 batch_cost: 1.8348 data_cost: 0.0846 ips: 10.3551 images/s
[08/05 19:28:31] ppdet.engine INFO: Epoch: [11] [200/756] learning_rate: 0.021983 loss: 1.523798 loss_cls: 0.711809 loss_iou: 0.164414 loss_dfl: 0.760219 loss_l1: 0.302910 eta: 4 days, 0:21:37 batch_cost: 1.7351 data_cost: 0.0463 ips: 10.9502 images/s
[08/05 19:31:40] ppdet.engine INFO: Epoch: [11] [300/756] learning_rate: 0.021982 loss: 1.513933 loss_cls: 0.717096 loss_iou: 0.167225 loss_dfl: 0.739312 loss_l1: 0.297152 eta: 4 days, 0:32:39 batch_cost: 1.7996 data_cost: 0.0283 ips: 10.5578 images/s
[08/05 19:34:48] ppdet.engine INFO: Epoch: [11] [400/756] learning_rate: 0.021982 loss: 1.501986 loss_cls: 0.704406 loss_iou: 0.162619 loss_dfl: 0.734104 loss_l1: 0.315250 eta: 4 days, 0:42:45 batch_cost: 1.7927 data_cost: 0.1247 ips: 10.5984 images/s
[08/05 19:37:58] ppdet.engine INFO: Epoch: [11] [500/756] learning_rate: 0.021981 loss: 1.409885 loss_cls: 0.677379 loss_iou: 0.158090 loss_dfl: 0.680260 loss_l1: 0.316385 eta: 4 days, 0:53:57 batch_cost: 1.8174 data_cost: 0.0184 ips: 10.4548 images/s
[08/05 19:40:57] ppdet.engine INFO: Epoch: [11] [600/756] learning_rate: 0.021980 loss: 1.512646 loss_cls: 0.728667 loss_iou: 0.176164 loss_dfl: 0.762893 loss_l1: 0.312988 eta: 4 days, 0:57:44 batch_cost: 1.7050 data_cost: 0.0586 ips: 11.1435 images/s
[08/05 19:44:08] ppdet.engine INFO: Epoch: [11] [700/756] learning_rate: 0.021979 loss: 1.502114 loss_cls: 0.722872 loss_iou: 0.176638 loss_dfl: 0.743730 loss_l1: 0.313116 eta: 4 days, 1:08:13 batch_cost: 1.8193 data_cost: 0.0230 ips: 10.4433 images/s
[08/05 19:45:50] ppdet.utils.checkpoint INFO: Save checkpoint: output/ppyoloe_crn_m_300e_coco
[08/05 19:46:05] ppdet.engine INFO: Epoch: [12] [ 0/756] learning_rate: 0.021979 loss: 1.489654 loss_cls: 0.714086 loss_iou: 0.164123 loss_dfl: 0.727358 loss_l1: 0.300987 eta: 4 days, 1:18:16 batch_cost: 1.9331 data_cost: 0.1582 ips: 9.8286 images/s
[08/05 19:49:03] ppdet.engine INFO: Epoch: [12] [100/756] learning_rate: 0.021978 loss: 1.535723 loss_cls: 0.735075 loss_iou: 0.175064 loss_dfl: 0.775793 loss_l1: 0.320222 eta: 4 days, 1:20:08 batch_cost: 1.6859 data_cost: 0.3077 ips: 11.2700 images/s
[08/05 19:52:02] ppdet.engine INFO: Epoch: [12] [200/756] learning_rate: 0.021977 loss: 1.574669 loss_cls: 0.741899 loss_iou: 0.181115 loss_dfl: 0.766592 loss_l1: 0.327325 eta: 4 days, 1:23:24 batch_cost: 1.7127 data_cost: 0.4614 ips: 11.0938 images/s
[08/05 19:54:58] ppdet.engine INFO: Epoch: [12] [300/756] learning_rate: 0.021976 loss: 1.521750 loss_cls: 0.733138 loss_iou: 0.168299 loss_dfl: 0.754864 loss_l1: 0.290447 eta: 4 days, 1:24:15 batch_cost: 1.6736 data_cost: 0.0200 ips: 11.3528 images/s
[08/05 19:58:08] ppdet.engine INFO: Epoch: [12] [400/756] learning_rate: 0.021976 loss: 1.549378 loss_cls: 0.732208 loss_iou: 0.172740 loss_dfl: 0.738835 loss_l1: 0.308204 eta: 4 days, 1:32:37 batch_cost: 1.8097 data_cost: 0.0635 ips: 10.4991 images/s
[08/05 20:00:51] ppdet.engine INFO: Epoch: [12] [500/756] learning_rate: 0.021975 loss: 1.582717 loss_cls: 0.731991 loss_iou: 0.184041 loss_dfl: 0.778287 loss_l1: 0.306987 eta: 4 days, 1:26:22 batch_cost: 1.5513 data_cost: 0.0579 ips: 12.2475 images/s
[08/05 20:03:39] ppdet.engine INFO: Epoch: [12] [600/756] learning_rate: 0.021974 loss: 1.487326 loss_cls: 0.709646 loss_iou: 0.157174 loss_dfl: 0.735097 loss_l1: 0.304186 eta: 4 days, 1:22:18 batch_cost: 1.5896 data_cost: 0.0192 ips: 11.9530 images/s
[08/05 20:06:40] ppdet.engine INFO: Epoch: [12] [700/756] learning_rate: 0.021973 loss: 1.467928 loss_cls: 0.706993 loss_iou: 0.169246 loss_dfl: 0.714468 loss_l1: 0.292888 eta: 4 days, 1:25:39 batch_cost: 1.7273 data_cost: 0.0195 ips: 11.0001 images/s
[08/05 20:08:16] ppdet.engine INFO: Epoch: [13] [ 0/756] learning_rate: 0.021972 loss: 1.455560 loss_cls: 0.698659 loss_iou: 0.166491 loss_dfl: 0.718708 loss_l1: 0.292016 eta: 4 days, 1:24:02 batch_cost: 1.7230 data_cost: 0.1362 ips: 11.0274 images/s
[08/05 20:10:50] ppdet.engine INFO: Epoch: [13] [100/756] learning_rate: 0.021972 loss: 1.355267 loss_cls: 0.643903 loss_iou: 0.158441 loss_dfl: 0.692785 loss_l1: 0.293440 eta: 4 days, 1:13:09 batch_cost: 1.4595 data_cost: 0.0138 ips: 13.0181 images/s
[08/05 20:13:32] ppdet.engine INFO: Epoch: [13] [200/756] learning_rate: 0.021971 loss: 1.449767 loss_cls: 0.685130 loss_iou: 0.159514 loss_dfl: 0.720046 loss_l1: 0.304720 eta: 4 days, 1:07:10 batch_cost: 1.5500 data_cost: 0.2751 ips: 12.2581 images/s
[08/05 20:16:22] ppdet.engine INFO: Epoch: [13] [300/756] learning_rate: 0.021970 loss: 1.450139 loss_cls: 0.688248 loss_iou: 0.159984 loss_dfl: 0.693148 loss_l1: 0.296720 eta: 4 days, 1:04:51 batch_cost: 1.6202 data_cost: 0.0677 ips: 11.7268 images/s
[08/05 20:18:53] ppdet.engine INFO: Epoch: [13] [400/756] learning_rate: 0.021969 loss: 1.616142 loss_cls: 0.758870 loss_iou: 0.184080 loss_dfl: 0.798853 loss_l1: 0.303233 eta: 4 days, 0:53:24 batch_cost: 1.4380 data_cost: 0.1258 ips: 13.2128 images/s
[08/05 20:21:34] ppdet.engine INFO: Epoch: [13] [500/756] learning_rate: 0.021968 loss: 1.499214 loss_cls: 0.725733 loss_iou: 0.169427 loss_dfl: 0.745422 loss_l1: 0.321531 eta: 4 days, 0:46:38 batch_cost: 1.5278 data_cost: 0.0542 ips: 12.4363 images/s
[08/05 20:24:10] ppdet.engine INFO: Epoch: [13] [600/756] learning_rate: 0.021967 loss: 1.496453 loss_cls: 0.734000 loss_iou: 0.168168 loss_dfl: 0.756591 loss_l1: 0.307149 eta: 4 days, 0:38:15 batch_cost: 1.4924 data_cost: 0.0121 ips: 12.7308 images/s
Found inf or nan, current scale is: 32768.0, decrease to: 32768.0*0.5
[08/05 20:26:42] ppdet.engine INFO: Epoch: [13] [700/756] learning_rate: 0.021966 loss: 1.509843 loss_cls: 0.702937 loss_iou: 0.170641 loss_dfl: 0.748320 loss_l1: 0.322213 eta: 4 days, 0:27:14 batch_cost: 1.4342 data_cost: 0.0119 ips: 13.2480 images/s
[08/05 20:28:08] ppdet.utils.checkpoint INFO: Save checkpoint: output/ppyoloe_crn_m_300e_coco
[08/05 20:28:21] ppdet.engine INFO: Epoch: [14] [ 0/756] learning_rate: 0.021965 loss: 1.490198 loss_cls: 0.700627 loss_iou: 0.160917 loss_dfl: 0.744214 loss_l1: 0.318852 eta: 4 days, 0:26:55 batch_cost: 1.6022 data_cost: 0.1070 ips: 11.8589 images/s
[08/05 20:31:33] ppdet.engine INFO: Epoch: [14] [100/756] learning_rate: 0.021964 loss: 1.429422 loss_cls: 0.654224 loss_iou: 0.169717 loss_dfl: 0.726778 loss_l1: 0.296086 eta: 4 days, 0:34:55 batch_cost: 1.8329 data_cost: 0.0394 ips: 10.3658 images/s
[08/05 20:34:34] ppdet.engine INFO: Epoch: [14] [200/756] learning_rate: 0.021963 loss: 1.482516 loss_cls: 0.698548 loss_iou: 0.159747 loss_dfl: 0.720718 loss_l1: 0.311291 eta: 4 days, 0:37:09 batch_cost: 1.7148 data_cost: 0.0220 ips: 11.0803 images/s
[08/05 20:37:38] ppdet.engine INFO: Epoch: [14] [300/756] learning_rate: 0.021962 loss: 1.580098 loss_cls: 0.753470 loss_iou: 0.164715 loss_dfl: 0.757042 loss_l1: 0.298298 eta: 4 days, 0:41:18 batch_cost: 1.7594 data_cost: 0.0968 ips: 10.7990 images/s
[08/05 20:40:47] ppdet.engine INFO: Epoch: [14] [400/756] learning_rate: 0.021961 loss: 1.479481 loss_cls: 0.694133 loss_iou: 0.166504 loss_dfl: 0.748529 loss_l1: 0.314536 eta: 4 days, 0:47:26 batch_cost: 1.8072 data_cost: 0.0257 ips: 10.5133 images/s
[08/05 20:43:45] ppdet.engine INFO: Epoch: [14] [500/756] learning_rate: 0.021960 loss: 1.431082 loss_cls: 0.654331 loss_iou: 0.159842 loss_dfl: 0.684977 loss_l1: 0.313637 eta: 4 days, 0:48:05 batch_cost: 1.6894 data_cost: 0.0255 ips: 11.2469 images/s
[08/05 20:46:35] ppdet.engine INFO: Epoch: [14] [600/756] learning_rate: 0.021959 loss: 1.627365 loss_cls: 0.760310 loss_iou: 0.185618 loss_dfl: 0.799637 loss_l1: 0.325742 eta: 4 days, 0:45:04 batch_cost: 1.6085 data_cost: 0.0231 ips: 11.8126 images/s
[08/05 20:49:37] ppdet.engine INFO: Epoch: [14] [700/756] learning_rate: 0.021958 loss: 1.573179 loss_cls: 0.738426 loss_iou: 0.171765 loss_dfl: 0.779407 loss_l1: 0.310565 eta: 4 days, 0:47:36 batch_cost: 1.7352 data_cost: 0.1711 ips: 10.9496 images/s
C++ Traceback (most recent call last):
0 paddle::imperative::BasicEngine::Execute()
1 paddle::imperative::PreparedOp::Run(paddle::imperative::NameVariableWrapperMap const&, paddle::imperative::NameVariableWrapperMap const&, paddle::framework::AttributeMap const&, paddle::framework::AttributeMap const&)
2 std::_Function_handler<void (paddle::framework::ExecutionContext const&), paddle::framework::OpKernelRegistrarFunctor<paddle::platform::CUDAPlace, false, 0ul, paddle::operators::MaskedSelectGradCUDAKernel<paddle::platform::CUDADeviceContext, float>, paddle::operators::MaskedSelectGradCUDAKernel<paddle::platform::CUDADeviceContext, double>, paddle::operators::MaskedSelectGradCUDAKernel<paddle::platform::CUDADeviceContext, int>, paddle::operators::MaskedSelectGradCUDAKernel<paddle::platform::CUDADeviceContext, long> >::operator()(char const*, char const*, int) const::{lambda(paddle::framework::ExecutionContext const&)#1}>::_M_invoke(std::_Any_data const&, paddle::framework::ExecutionContext const&)
3 paddle::operators::MaskedSelectGradCUDAKernel<paddle::platform::CUDADeviceContext, float>::Compute(paddle::framework::ExecutionContext const&) const
4 thrust::detail::vector_base<int, thrust::device_allocator
Error Message Summary:
FatalError: Termination signal is detected by the operating system.
[TimeInfo: *** Aborted at 1659732619 (unix time) try "date -d @1659732619" if you are using GNU date ***]
[SignalInfo: *** SIGTERM (@0xe68c) received by PID 59132 (TID 0x7faa9f6c9740) from PID 59020 ***]
workerlog.5 错误日志:
/home/conda/envs/paddle/lib/python3.8/site-packages/paddle/vision/transforms/functional_pil.py:36: DeprecationWarning: NEAREST is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.NEAREST or Dither.NONE instead.
'nearest': Image.NEAREST,
/home/conda/envs/paddle/lib/python3.8/site-packages/paddle/vision/transforms/functional_pil.py:37: DeprecationWarning: BILINEAR is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BILINEAR instead.
'bilinear': Image.BILINEAR,
/home/conda/envs/paddle/lib/python3.8/site-packages/paddle/vision/transforms/functional_pil.py:38: DeprecationWarning: BICUBIC is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BICUBIC instead.
'bicubic': Image.BICUBIC,
/home/conda/envs/paddle/lib/python3.8/site-packages/paddle/vision/transforms/functional_pil.py:39: DeprecationWarning: BOX is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BOX instead.
'box': Image.BOX,
/home/conda/envs/paddle/lib/python3.8/site-packages/paddle/vision/transforms/functional_pil.py:40: DeprecationWarning: LANCZOS is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.LANCZOS instead.
'lanczos': Image.LANCZOS,
/home/conda/envs/paddle/lib/python3.8/site-packages/paddle/vision/transforms/functional_pil.py:41: DeprecationWarning: HAMMING is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.HAMMING instead.
'hamming': Image.HAMMING
/home/conda/envs/paddle/lib/python3.8/site-packages/paddle/tensor/creation.py:130: DeprecationWarning: np.object is a deprecated alias for the builtin object. To silence this warning, use object by itself. Doing this will not modify any behavior and is safe.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
if data.dtype == np.object:
I0805 16:50:10.935909 59217 gen_comm_id_helper.cc:190] Server listening on: 127.0.0.1:58872 successful.
I0805 16:50:14.215627 59217 nccl_context.cc:74] init nccl context nranks: 8 local rank: 5 gpu id: 5 ring id: 0
W0805 16:50:18.039150 59217 device_context.cc:447] Please NOTE: device: 5, GPU Compute Capability: 8.0, Driver API Version: 11.3, Runtime API Version: 11.0
W0805 16:50:18.057273 59217 device_context.cc:465] device: 5, cuDNN Version: 8.2.
loading annotations into memory...
Done (t=6.85s)
creating index...
index created!
Found inf or nan, current scale is: 32768.0, decrease to: 32768.0*0.5
Traceback (most recent call last):
File "tools/train.py", line 177, in
我使用的软件版本: paddlepaddle-gpu:2.2.2.post110 paddledetection:2.4 cuda:11.3
训练指令: python -m paddle.distributed.launch --gpus 0,1,2,3,4,5,6,7 tools/train.py -c configs/ppyoloe/ppyoloe_crn_m_300e_coco.yml -r output/ppyoloe_crn_m_300e_coco/3 --amp
若不使用--amp训练的话,则会出现下面的报错日志:
/home/conda/envs/pdpd/lib/python3.8/site-packages/paddle/vision/transforms/functional_pil.py:36: DeprecationWarning: NEAREST is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.NEAREST or Dither.NONE instead.
'nearest': Image.NEAREST,
/home/conda/envs/pdpd/lib/python3.8/site-packages/paddle/vision/transforms/functional_pil.py:37: DeprecationWarning: BILINEAR is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BILINEAR instead.
'bilinear': Image.BILINEAR,
/home/conda/envs/pdpd/lib/python3.8/site-packages/paddle/vision/transforms/functional_pil.py:38: DeprecationWarning: BICUBIC is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BICUBIC instead.
'bicubic': Image.BICUBIC,
/home/conda/envs/pdpd/lib/python3.8/site-packages/paddle/vision/transforms/functional_pil.py:39: DeprecationWarning: BOX is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BOX instead.
'box': Image.BOX,
/home/conda/envs/pdpd/lib/python3.8/site-packages/paddle/vision/transforms/functional_pil.py:40: DeprecationWarning: LANCZOS is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.LANCZOS instead.
'lanczos': Image.LANCZOS,
/home/conda/envs/pdpd/lib/python3.8/site-packages/paddle/vision/transforms/functional_pil.py:41: DeprecationWarning: HAMMING is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.HAMMING instead.
'hamming': Image.HAMMING
/home/conda/envs/pdpd/lib/python3.8/site-packages/paddle/tensor/creation.py:130: DeprecationWarning: np.object is a deprecated alias for the builtin object. To silence this warning, use object by itself. Doing this will not modify any behavior and is safe.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
if data.dtype == np.object:
server not ready, wait 3 sec to retry...
not ready endpoints:['127.0.0.1:43046', '127.0.0.1:39758', '127.0.0.1:21007', '127.0.0.1:58682']
server not ready, wait 3 sec to retry...
not ready endpoints:['127.0.0.1:43046', '127.0.0.1:39758', '127.0.0.1:21007', '127.0.0.1:58682']
server not ready, wait 3 sec to retry...
not ready endpoints:['127.0.0.1:58682']
I0805 17:41:05.297622 9057 nccl_context.cc:74] init nccl context nranks: 5 local rank: 0 gpu id: 0 ring id: 0
W0805 17:41:08.299757 9057 device_context.cc:447] Please NOTE: device: 0, GPU Compute Capability: 8.0, Driver API Version: 11.3, Runtime API Version: 11.0
W0805 17:41:08.331398 9057 device_context.cc:465] device: 0, cuDNN Version: 8.2.
loading annotations into memory...
Done (t=7.74s)
creating index...
index created!
[08/05 17:46:44] ppdet.utils.checkpoint INFO: ['yolo_head.anchor_points', 'yolo_head.stride_tensor'] in pretrained weight is not used in the model, and its will not be loaded
[08/05 17:46:44] ppdet.utils.checkpoint INFO: The shape [80] in pretrained weight yolo_head.pred_cls.0.bias is unmatched with the shape [1] in model yolo_head.pred_cls.0.bias. And the weight yolo_head.pred_cls.0.bias will not be loaded
[08/05 17:46:44] ppdet.utils.checkpoint INFO: The shape [80, 576, 3, 3] in pretrained weight yolo_head.pred_cls.0.weight is unmatched with the shape [1, 576, 3, 3] in model yolo_head.pred_cls.0.weight. And the weight yolo_head.pred_cls.0.weight will not be loaded
[08/05 17:46:44] ppdet.utils.checkpoint INFO: The shape [80] in pretrained weight yolo_head.pred_cls.1.bias is unmatched with the shape [1] in model yolo_head.pred_cls.1.bias. And the weight yolo_head.pred_cls.1.bias will not be loaded
[08/05 17:46:44] ppdet.utils.checkpoint INFO: The shape [80, 288, 3, 3] in pretrained weight yolo_head.pred_cls.1.weight is unmatched with the shape [1, 288, 3, 3] in model yolo_head.pred_cls.1.weight. And the weight yolo_head.pred_cls.1.weight will not be loaded
[08/05 17:46:44] ppdet.utils.checkpoint INFO: The shape [80] in pretrained weight yolo_head.pred_cls.2.bias is unmatched with the shape [1] in model yolo_head.pred_cls.2.bias. And the weight yolo_head.pred_cls.2.bias will not be loaded
[08/05 17:46:44] ppdet.utils.checkpoint INFO: The shape [80, 144, 3, 3] in pretrained weight yolo_head.pred_cls.2.weight is unmatched with the shape [1, 144, 3, 3] in model yolo_head.pred_cls.2.weight. And the weight yolo_head.pred_cls.2.weight will not be loaded
[08/05 17:46:44] ppdet.utils.checkpoint INFO: Finish loading model weights: /home/general_human_detect/ppyoloe_a100_02/ppyoloe_crn_m_300e_coco.pdparams
C++ Traceback (most recent call last):
0 void paddle::memory::Copy<paddle::platform::CPUPlace, paddle::platform::CUDAPlace>(paddle::platform::CPUPlace, void*, paddle::platform::CUDAPlace, void const*, unsigned long, CUstream_st*) 1 paddle::platform::GpuMemcpySync(void*, void const*, unsigned long, cudaMemcpyKind)
Error Message Summary:
FatalError: Termination signal is detected by the operating system.
[TimeInfo: *** Aborted at 1659721626 (unix time) try "date -d @1659721626" if you are using GNU date ***]
[SignalInfo: *** SIGTERM (@0x22f1) received by PID 9057 (TID 0x7f13fe8336c0) from PID 8945 ***]
/home/conda/envs/pdpd/lib/python3.8/site-packages/paddle/vision/transforms/functional_pil.py:36: DeprecationWarning: NEAREST is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.NEAREST or Dither.NONE instead.
'nearest': Image.NEAREST,
/home/conda/envs/pdpd/lib/python3.8/site-packages/paddle/vision/transforms/functional_pil.py:37: DeprecationWarning: BILINEAR is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BILINEAR instead.
'bilinear': Image.BILINEAR,
/home/conda/envs/pdpd/lib/python3.8/site-packages/paddle/vision/transforms/functional_pil.py:38: DeprecationWarning: BICUBIC is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BICUBIC instead.
'bicubic': Image.BICUBIC,
/home/conda/envs/pdpd/lib/python3.8/site-packages/paddle/vision/transforms/functional_pil.py:39: DeprecationWarning: BOX is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BOX instead.
'box': Image.BOX,
/home/conda/envs/pdpd/lib/python3.8/site-packages/paddle/vision/transforms/functional_pil.py:40: DeprecationWarning: LANCZOS is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.LANCZOS instead.
'lanczos': Image.LANCZOS,
/home/conda/envs/pdpd/lib/python3.8/site-packages/paddle/vision/transforms/functional_pil.py:41: DeprecationWarning: HAMMING is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.HAMMING instead.
'hamming': Image.HAMMING
/home/conda/envs/pdpd/lib/python3.8/site-packages/paddle/tensor/creation.py:130: DeprecationWarning: np.object is a deprecated alias for the builtin object. To silence this warning, use object by itself. Doing this will not modify any behavior and is safe.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
if data.dtype == np.object:
server not ready, wait 3 sec to retry...
not ready endpoints:['127.0.0.1:50274', '127.0.0.1:29379', '127.0.0.1:53201', '127.0.0.1:27633']
server not ready, wait 3 sec to retry...
not ready endpoints:['127.0.0.1:29379', '127.0.0.1:53201', '127.0.0.1:27633']
server not ready, wait 3 sec to retry...
not ready endpoints:['127.0.0.1:27633']
I0805 17:51:27.291689 13459 nccl_context.cc:74] init nccl context nranks: 5 local rank: 0 gpu id: 0 ring id: 0
W0805 17:51:30.223253 13459 device_context.cc:447] Please NOTE: device: 0, GPU Compute Capability: 8.0, Driver API Version: 11.3, Runtime API Version: 11.0
W0805 17:51:30.263077 13459 device_context.cc:465] device: 0, cuDNN Version: 8.2.
loading annotations into memory...
Done (t=7.87s)
creating index...
index created!
[08/05 17:57:14] ppdet.utils.checkpoint INFO: ['yolo_head.anchor_points', 'yolo_head.stride_tensor'] in pretrained weight is not used in the model, and its will not be loaded
[08/05 17:57:14] ppdet.utils.checkpoint INFO: The shape [80] in pretrained weight yolo_head.pred_cls.0.bias is unmatched with the shape [1] in model yolo_head.pred_cls.0.bias. And the weight yolo_head.pred_cls.0.bias will not be loaded
[08/05 17:57:14] ppdet.utils.checkpoint INFO: The shape [80, 576, 3, 3] in pretrained weight yolo_head.pred_cls.0.weight is unmatched with the shape [1, 576, 3, 3] in model yolo_head.pred_cls.0.weight. And the weight yolo_head.pred_cls.0.weight will not be loaded
[08/05 17:57:14] ppdet.utils.checkpoint INFO: The shape [80] in pretrained weight yolo_head.pred_cls.1.bias is unmatched with the shape [1] in model yolo_head.pred_cls.1.bias. And the weight yolo_head.pred_cls.1.bias will not be loaded
[08/05 17:57:14] ppdet.utils.checkpoint INFO: The shape [80, 288, 3, 3] in pretrained weight yolo_head.pred_cls.1.weight is unmatched with the shape [1, 288, 3, 3] in model yolo_head.pred_cls.1.weight. And the weight yolo_head.pred_cls.1.weight will not be loaded
[08/05 17:57:14] ppdet.utils.checkpoint INFO: The shape [80] in pretrained weight yolo_head.pred_cls.2.bias is unmatched with the shape [1] in model yolo_head.pred_cls.2.bias. And the weight yolo_head.pred_cls.2.bias will not be loaded
[08/05 17:57:14] ppdet.utils.checkpoint INFO: The shape [80, 144, 3, 3] in pretrained weight yolo_head.pred_cls.2.weight is unmatched with the shape [1, 144, 3, 3] in model yolo_head.pred_cls.2.weight. And the weight yolo_head.pred_cls.2.weight will not be loaded
[08/05 17:57:15] ppdet.utils.checkpoint INFO: Finish loading model weights: /home/general_human_detect/ppyoloe_a100_02/ppyoloe_crn_m_300e_coco.pdparams
[08/05 17:57:29] ppdet.engine INFO: Epoch: [0] [ 0/1916] learning_rate: 0.000000 loss: 8.302918 loss_cls: 6.040267 loss_iou: 0.518179 loss_dfl: 1.934408 loss_l1: 0.585996 eta: 94 days, 12:16:10 batch_cost: 14.2063 data_cost: 1.7656 ips: 0.8447 images/s
Traceback (most recent call last):
File "tools/train.py", line 177, in
用的8卡a100,刚开始每个卡最多只占用30G(每个卡40G),从日志来看大概训练了3个多小时后报错,在训练期间没有其他程序运行使用显卡。用当前最新训练好的9.pdparams模型做了推理,看起来是正常的能识别目标

看报错信息可能是系统资源不够或者系统环境的问题,重新resume一下看看呢
看报错信息可能是系统资源不够或者系统环境的问题,重新resume一下看看呢
基于--amp的命令,进行resume,可以继续训练,训练大概10多个小时后,又报同样的错误,然后继续resume是可以训练的 :std::bad_alloc: cudaErrorMemoryAllocation: out of memory. (at /paddle/paddle/fluid/imperative/tracer.cc:221) 然后我把训练的模型推理了一下,看起来也是正确的 我观察了一下从我挂起训练一段时间之后,显存是有较大幅度的增长的,以下是对比图,从1点多到8点多,8块卡我是独占的,确认了没有其他的程序在使用显存
设置下 export FLAGS_allocator_strategy=naive_best_fit 再试试