PaddleX icon indicating copy to clipboard operation
PaddleX copied to clipboard

训练4天然后突然报错了

Open Phoenix8215 opened this issue 2 years ago • 1 comments

Checklist:

  1. 查找历史相关issue寻求解答
  2. 翻阅FAQ常见问题汇总和答疑
  3. 确认bug是否在新版本里还未修复
  4. 翻阅PaddleX 使用文档

描述问题

2023-02-12 11:33:49 [INFO]      [TRAIN] Epoch=44/300, Step=1365/2425, loss_xy=4.820615, loss_wh=5.574079, loss_iou=19.460461, loss_iou_aware=4.270789, loss_obj=34.582016, loss_cls=7.113638, loss=75.821602, lr=0.000417, time_each_step=2.14s, eta=369:17:28
2023-02-12 11:34:13 [INFO]      [TRAIN] Epoch=44/300, Step=1375/2425, loss_xy=2.955108, loss_wh=3.437961, loss_iou=12.110320, loss_iou_aware=2.348799, loss_obj=20.727703, loss_cls=3.568110, loss=45.148003, lr=0.000417, time_each_step=2.38s, eta=411:55:13
2023-02-12 11:34:36 [INFO]      [TRAIN] Epoch=44/300, Step=1385/2425, loss_xy=4.763587, loss_wh=8.038626, loss_iou=21.165606, loss_iou_aware=3.268857, loss_obj=26.561720, loss_cls=4.742907, loss=68.541298, lr=0.000417, time_each_step=2.29s, eta=395:54:13
Exception in thread Thread-45:
Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.8/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 583, in _get_data
    data = self._data_queue.get(timeout=self._timeout)
  File "/root/miniconda3/lib/python3.8/multiprocessing/queues.py", line 108, in get
    raise Empty
_queue.Empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/root/miniconda3/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/root/miniconda3/lib/python3.8/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 505, in _thread_loop
    batch = self._get_data()
  File "/root/miniconda3/lib/python3.8/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 598, in _get_data
    raise RuntimeError("DataLoader {} workers exit unexpectedly, " \
RuntimeError: DataLoader 1 workers exit unexpectedly, pids: 145393
Traceback (most recent call last):
  File "demo.py", line 46, in <module>
    model.train(
  File "/root/miniconda3/lib/python3.8/site-packages/paddlex/cv/models/detector.py", line 323, in train
    self.train_loop(
  File "/root/miniconda3/lib/python3.8/site-packages/paddlex/cv/models/base.py", line 333, in train_loop
    for step, data in enumerate(self.train_data_loader()):
  File "/root/miniconda3/lib/python3.8/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 697, in __next__
    data = self._reader.read_next_var_list()
SystemError: (Fatal) Blocking queue is killed because the data reader raises an exception.
  [Hint: Expected killed_ != true, but received killed_:1 == true:1.] (at /paddle/paddle/fluid/operators/reader/blocking_queue.h:166)

复现

  1. 您是否已经正常运行我们提供的教程

  2. 您是否在教程的基础上修改代码内容?还请您提供运行的代码

  3. 您使用的数据集是?

  4. 请提供您出现的报错信息及相关log

环境

  1. 请提供您使用的PaddlePaddle和PaddleX的版本号

  2. 请提供您使用的操作系统信息,如Linux/Windows/MacOS

  3. 请问您使用的Python版本是?

  4. 请问您使用的CUDA/cuDNN的版本号是?

Phoenix8215 avatar Feb 12 '23 10:02 Phoenix8215

看错误是读取样本数据失败了,确认下出错时文件或者系统(如磁盘空间)是否有异常

lailuboy avatar Feb 24 '23 03:02 lailuboy