PaddleGAN icon indicating copy to clipboard operation
PaddleGAN copied to clipboard

StyleGAN多卡任务出现DataLoader段错误

Open KernelErr opened this issue 4 years ago • 7 comments

环境

  • PaddleGAN: Commit 97f96b9
  • AI Studio脚本任务默认环境
  • Numpy降级到1.15(由于AI Studio环境下scikit-image版本过低,且升级困难,需要将numpy降级)

问题

AI Studio脚本任务4卡训练StyleGAN2时,提示DataLoader段错误。

日志

W0627 16:19:49.616700   868 device_context.cc:404] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 10.1, Runtime API Version: 10.1
W0627 16:19:49.621624   868 device_context.cc:422] device: 0, cuDNN Version: 7.6.
W0627 16:19:59.556109   868 gen_comm_id_helper.cc:120] connect addr=127.0.0.1:54470 failed 1 times with reason: Connection refused retry after 0.5 seconds
I0627 16:20:00.056545   868 nccl_context.cc:74] init nccl context nranks: 4 local rank: 0 gpu id: 0 ring id: 0
ERROR: Unexpected segmentation fault encountered in DataLoader workers.
 ERROR:root:DataLoader reader thread raised an exception!Traceback (most recent call last):

  File "tools/main.py", line 56, in <module>
    main(args, cfg)
  File "tools/main.py", line 46, in main
    trainer.train()
  File "/root/paddlejob/workspace/code/PaddleGAN-develop/ppgan/engine/trainer.py", line 181, in train
    data = next(iter_loader)
  File "/root/paddlejob/workspace/code/PaddleGAN-develop/ppgan/engine/trainer.py", line 44, in __next__
    Exception in thread Thread-1:
Traceback (most recent call last):
  File "/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 482, in _get_data
    data = self._data_queue.get(timeout=self._timeout)
  File "/opt/_internal/cpython-3.7.0/lib/python3.7/multiprocessing/queues.py", line 105, in get
    raise Empty
_queue.Empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/_internal/cpython-3.7.0/lib/python3.7/threading.py", line 917, in _bootstrap_inner
    self.run()
  File "/opt/_internal/cpython-3.7.0/lib/python3.7/threading.py", line 865, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 411, in _thread_loop
    batch = self._get_data()
  File "/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 498, in _get_data
    "pids: {}".format(len(failed_workers), pids))
RuntimeError: DataLoader 1 workers exit unexpectedly, pids: 1063
data = next(self.iter_loader)

  File "/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 585, in __next__
    data = self._reader.read_next_var_list()
SystemError: (Fatal) Blocking queue is killed because the data reader raises an exception.
  [Hint: Expected killed_ != true, but received killed_:1 == true:1.] (at /paddle/paddle/fluid/operators/reader/blocking_queue.h:166)

KernelErr avatar Jun 27 '21 08:06 KernelErr

切换成Notebook,单卡训练也出现此问题,训练集图片大约5w张图片,减少图片数量后不会报错

KernelErr avatar Jun 27 '21 08:06 KernelErr

感谢反馈,我们先尝试复现一下

LielinJiang avatar Jun 29 '21 02:06 LielinJiang

感谢反馈,我们先尝试复现一下

昨天又检查了一下,怀疑是图片格式的问题,原本的图片名中带有中文和日文,所以预处理的时候OpenCV用的下面的导出方法:

cv2.imencode('.png', image)[1].tofile(save_path)

整个数据集有50k张图片, png格式的,5G大小。这样的数据会出现段错误。

如果将文件名改成纯数字,并且采用jpg格式,使用OpenCV的imwrite去导出,数据集只有1G大小,且不会出现段错误。

KernelErr avatar Jun 29 '21 02:06 KernelErr

感谢反馈,我们先尝试复现一下

昨天又检查了一下,怀疑是图片格式的问题,原本的图片名中带有中文和日文,所以预处理的时候OpenCV用的下面的导出方法:

cv2.imencode('.png', image)[1].tofile(save_path)

整个数据集有50k张图片, png格式的,5G大小。这样的数据会出现段错误。

如果将文件名改成纯数字,并且采用jpg格式,使用OpenCV的imwrite去导出,数据集只有1G大小,且不会出现段错误。

好的,如果是怀疑图片读取的问题,可以在https://github.com/PaddlePaddle/PaddleGAN/blob/develop/ppgan/datasets/builder.py#L36 这个语句后,对dataset进行遍历来检测是否有错误:

for data in dataset:
    pass

或者将worker设置为0,这样也可以更直观的观察到是否是读取数据报错

LielinJiang avatar Jun 29 '21 02:06 LielinJiang

换为jpg格式后,训练到Iter 5000的时候,出现了一个错误:

[06/28 22:56:17] ppgan.engine.trainer INFO: Iter: 4950/800000 lr: 1.600e-03 l_d: 0.904 real_score: 3.060 fake_score: 0.183 l_g: 4.119 l_g_path: 0.003 path_length: 0.287 l_d_r1: 1.740 batch_cost: 1.70931 sec reader_cost: 0.00025 sec ips: 0.58503 images/s eta: 15 days, 17:29:49
INFO 2021-06-28 22:57:49,182 launch_utils.py:327] terminate all the procs
ERROR 2021-06-28 22:57:49,182 launch_utils.py:584] ABORT!!! Out of all 4 trainers, the trainer process with rank=[0, 1, 2, 3] was aborted. Please check its log.
INFO 2021-06-28 22:57:52,185 launch_utils.py:327] terminate all the procs
[06/28 22:57:42] ppgan.engine.trainer INFO: Iter: 5000/800000 lr: 1.600e-03 l_d: 0.245 real_score: 1.648 fake_score: -2.812 l_g: 2.259 l_g_path: 0.000 path_length: 0.268 l_d_r1: 2.520 batch_cost: 1.71203 sec reader_cost: 0.00026 sec ips: 0.58410 images/s eta: 15 days, 18:04:23
Traceback (most recent call last):
  File "/root/paddlejob/workspace/code/PaddleGAN-develop/ppgan/utils/config.py", line 24, in __getattr__
    return self[key]
KeyError: 'test'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "tools/main.py", line 56, in <module>
    main(args, cfg)
  File "tools/main.py", line 46, in main
    trainer.train()
  File "/root/paddlejob/workspace/code/PaddleGAN-develop/ppgan/engine/trainer.py", line 209, in train
    self.test()
  File "/root/paddlejob/workspace/code/PaddleGAN-develop/ppgan/engine/trainer.py", line 219, in test
    self.test_dataloader = build_dataloader(self.cfg.dataset.test,
  File "/root/paddlejob/workspace/code/PaddleGAN-develop/ppgan/utils/config.py", line 26, in __getattr__
    raise AttributeError(key)
AttributeError: test
/mnt
[INFO]: train job failed! train_ret: 1

将配置文件中的

validate:
  interval: 5000
  save_imig: False
  metrics:
    fid: # metric name, can be arbitrary
      name: FID
      batch_size: 4

注释掉后不会出现问题,不知道这个是为什么

KernelErr avatar Jun 29 '21 02:06 KernelErr

感谢反馈,我们先尝试复现一下

昨天又检查了一下,怀疑是图片格式的问题,原本的图片名中带有中文和日文,所以预处理的时候OpenCV用的下面的导出方法:

cv2.imencode('.png', image)[1].tofile(save_path)

整个数据集有50k张图片, png格式的,5G大小。这样的数据会出现段错误。 如果将文件名改成纯数字,并且采用jpg格式,使用OpenCV的imwrite去导出,数据集只有1G大小,且不会出现段错误。

好的,如果是怀疑图片读取的问题,可以在https://github.com/PaddlePaddle/PaddleGAN/blob/develop/ppgan/datasets/builder.py#L36 这个语句后,对dataset进行遍历来检测是否有错误:

for data in dataset:
    pass

或者将worker设置为0,这样也可以更直观的观察到是否是读取数据报错

好的,我这边有空去找一下原来的图片,定位一下原因,看看是图片本身的问题,还是程序有问题

KernelErr avatar Jun 29 '21 02:06 KernelErr

换为jpg格式后,训练到Iter 5000的时候,出现了一个错误:

[06/28 22:56:17] ppgan.engine.trainer INFO: Iter: 4950/800000 lr: 1.600e-03 l_d: 0.904 real_score: 3.060 fake_score: 0.183 l_g: 4.119 l_g_path: 0.003 path_length: 0.287 l_d_r1: 1.740 batch_cost: 1.70931 sec reader_cost: 0.00025 sec ips: 0.58503 images/s eta: 15 days, 17:29:49
INFO 2021-06-28 22:57:49,182 launch_utils.py:327] terminate all the procs
ERROR 2021-06-28 22:57:49,182 launch_utils.py:584] ABORT!!! Out of all 4 trainers, the trainer process with rank=[0, 1, 2, 3] was aborted. Please check its log.
INFO 2021-06-28 22:57:52,185 launch_utils.py:327] terminate all the procs
[06/28 22:57:42] ppgan.engine.trainer INFO: Iter: 5000/800000 lr: 1.600e-03 l_d: 0.245 real_score: 1.648 fake_score: -2.812 l_g: 2.259 l_g_path: 0.000 path_length: 0.268 l_d_r1: 2.520 batch_cost: 1.71203 sec reader_cost: 0.00026 sec ips: 0.58410 images/s eta: 15 days, 18:04:23
Traceback (most recent call last):
  File "/root/paddlejob/workspace/code/PaddleGAN-develop/ppgan/utils/config.py", line 24, in __getattr__
    return self[key]
KeyError: 'test'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "tools/main.py", line 56, in <module>
    main(args, cfg)
  File "tools/main.py", line 46, in main
    trainer.train()
  File "/root/paddlejob/workspace/code/PaddleGAN-develop/ppgan/engine/trainer.py", line 209, in train
    self.test()
  File "/root/paddlejob/workspace/code/PaddleGAN-develop/ppgan/engine/trainer.py", line 219, in test
    self.test_dataloader = build_dataloader(self.cfg.dataset.test,
  File "/root/paddlejob/workspace/code/PaddleGAN-develop/ppgan/utils/config.py", line 26, in __getattr__
    raise AttributeError(key)
AttributeError: test
/mnt
[INFO]: train job failed! train_ret: 1

将配置文件中的

validate:
  interval: 5000
  save_imig: False
  metrics:
    fid: # metric name, can be arbitrary
      name: FID
      batch_size: 4

注释掉后不会出现问题,不知道这个是为什么

这个应该是bug,我们尽快修复下

LielinJiang avatar Jun 29 '21 02:06 LielinJiang

问题过于久远,如果有图像和视频生成的需求,可以使用新的跨模态工具: https://github.com/PaddlePaddle/PaddleMIX/tree/develop

JunnYu avatar Feb 29 '24 03:02 JunnYu