PaddleGAN StyleGAN多卡任务出现DataLoader段错误

环境

PaddleGAN: Commit 97f96b9
AI Studio脚本任务默认环境
Numpy降级到1.15（由于AI Studio环境下scikit-image版本过低，且升级困难，需要将numpy降级）

问题

AI Studio脚本任务4卡训练StyleGAN2时，提示DataLoader段错误。

日志

W0627 16:19:49.616700   868 device_context.cc:404] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 10.1, Runtime API Version: 10.1
W0627 16:19:49.621624   868 device_context.cc:422] device: 0, cuDNN Version: 7.6.
W0627 16:19:59.556109   868 gen_comm_id_helper.cc:120] connect addr=127.0.0.1:54470 failed 1 times with reason: Connection refused retry after 0.5 seconds
I0627 16:20:00.056545   868 nccl_context.cc:74] init nccl context nranks: 4 local rank: 0 gpu id: 0 ring id: 0
ERROR: Unexpected segmentation fault encountered in DataLoader workers.
 ERROR:root:DataLoader reader thread raised an exception!Traceback (most recent call last):

  File "tools/main.py", line 56, in <module>
    main(args, cfg)
  File "tools/main.py", line 46, in main
    trainer.train()
  File "/root/paddlejob/workspace/code/PaddleGAN-develop/ppgan/engine/trainer.py", line 181, in train
    data = next(iter_loader)
  File "/root/paddlejob/workspace/code/PaddleGAN-develop/ppgan/engine/trainer.py", line 44, in __next__
    Exception in thread Thread-1:
Traceback (most recent call last):
  File "/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 482, in _get_data
    data = self._data_queue.get(timeout=self._timeout)
  File "/opt/_internal/cpython-3.7.0/lib/python3.7/multiprocessing/queues.py", line 105, in get
    raise Empty
_queue.Empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/_internal/cpython-3.7.0/lib/python3.7/threading.py", line 917, in _bootstrap_inner
    self.run()
  File "/opt/_internal/cpython-3.7.0/lib/python3.7/threading.py", line 865, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 411, in _thread_loop
    batch = self._get_data()
  File "/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 498, in _get_data
    "pids: {}".format(len(failed_workers), pids))
RuntimeError: DataLoader 1 workers exit unexpectedly, pids: 1063
data = next(self.iter_loader)

  File "/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 585, in __next__
    data = self._reader.read_next_var_list()
SystemError: (Fatal) Blocking queue is killed because the data reader raises an exception.
  [Hint: Expected killed_ != true, but received killed_:1 == true:1.] (at /paddle/paddle/fluid/operators/reader/blocking_queue.h:166)

Jun 27 '21 08:06 KernelErr

切换成Notebook，单卡训练也出现此问题，训练集图片大约5w张图片，减少图片数量后不会报错

Jun 27 '21 08:06 KernelErr

感谢反馈，我们先尝试复现一下

Jun 29 '21 02:06 LielinJiang

感谢反馈，我们先尝试复现一下

昨天又检查了一下，怀疑是图片格式的问题，原本的图片名中带有中文和日文，所以预处理的时候OpenCV用的下面的导出方法：

cv2.imencode('.png', image)[1].tofile(save_path)

整个数据集有50k张图片， png格式的，5G大小。这样的数据会出现段错误。

如果将文件名改成纯数字，并且采用jpg格式，使用OpenCV的imwrite去导出，数据集只有1G大小，且不会出现段错误。

Jun 29 '21 02:06 KernelErr

感谢反馈，我们先尝试复现一下

昨天又检查了一下，怀疑是图片格式的问题，原本的图片名中带有中文和日文，所以预处理的时候OpenCV用的下面的导出方法：
cv2.imencode('.png', image)[1].tofile(save_path)
整个数据集有50k张图片， png格式的，5G大小。这样的数据会出现段错误。

如果将文件名改成纯数字，并且采用jpg格式，使用OpenCV的imwrite去导出，数据集只有1G大小，且不会出现段错误。

好的，如果是怀疑图片读取的问题，可以在https://github.com/PaddlePaddle/PaddleGAN/blob/develop/ppgan/datasets/builder.py#L36 这个语句后，对dataset进行遍历来检测是否有错误：

for data in dataset:
    pass

或者将worker设置为0，这样也可以更直观的观察到是否是读取数据报错

Jun 29 '21 02:06 LielinJiang

换为jpg格式后，训练到Iter 5000的时候，出现了一个错误：

[06/28 22:56:17] ppgan.engine.trainer INFO: Iter: 4950/800000 lr: 1.600e-03 l_d: 0.904 real_score: 3.060 fake_score: 0.183 l_g: 4.119 l_g_path: 0.003 path_length: 0.287 l_d_r1: 1.740 batch_cost: 1.70931 sec reader_cost: 0.00025 sec ips: 0.58503 images/s eta: 15 days, 17:29:49
INFO 2021-06-28 22:57:49,182 launch_utils.py:327] terminate all the procs
ERROR 2021-06-28 22:57:49,182 launch_utils.py:584] ABORT!!! Out of all 4 trainers, the trainer process with rank=[0, 1, 2, 3] was aborted. Please check its log.
INFO 2021-06-28 22:57:52,185 launch_utils.py:327] terminate all the procs
[06/28 22:57:42] ppgan.engine.trainer INFO: Iter: 5000/800000 lr: 1.600e-03 l_d: 0.245 real_score: 1.648 fake_score: -2.812 l_g: 2.259 l_g_path: 0.000 path_length: 0.268 l_d_r1: 2.520 batch_cost: 1.71203 sec reader_cost: 0.00026 sec ips: 0.58410 images/s eta: 15 days, 18:04:23
Traceback (most recent call last):
  File "/root/paddlejob/workspace/code/PaddleGAN-develop/ppgan/utils/config.py", line 24, in __getattr__
    return self[key]
KeyError: 'test'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "tools/main.py", line 56, in <module>
    main(args, cfg)
  File "tools/main.py", line 46, in main
    trainer.train()
  File "/root/paddlejob/workspace/code/PaddleGAN-develop/ppgan/engine/trainer.py", line 209, in train
    self.test()
  File "/root/paddlejob/workspace/code/PaddleGAN-develop/ppgan/engine/trainer.py", line 219, in test
    self.test_dataloader = build_dataloader(self.cfg.dataset.test,
  File "/root/paddlejob/workspace/code/PaddleGAN-develop/ppgan/utils/config.py", line 26, in __getattr__
    raise AttributeError(key)
AttributeError: test
/mnt
[INFO]: train job failed! train_ret: 1

将配置文件中的

validate:
  interval: 5000
  save_imig: False
  metrics:
    fid: # metric name, can be arbitrary
      name: FID
      batch_size: 4

注释掉后不会出现问题，不知道这个是为什么

Jun 29 '21 02:06 KernelErr

感谢反馈，我们先尝试复现一下

昨天又检查了一下，怀疑是图片格式的问题，原本的图片名中带有中文和日文，所以预处理的时候OpenCV用的下面的导出方法：
cv2.imencode('.png', image)[1].tofile(save_path)
整个数据集有50k张图片， png格式的，5G大小。这样的数据会出现段错误。如果将文件名改成纯数字，并且采用jpg格式，使用OpenCV的imwrite去导出，数据集只有1G大小，且不会出现段错误。
好的，如果是怀疑图片读取的问题，可以在https://github.com/PaddlePaddle/PaddleGAN/blob/develop/ppgan/datasets/builder.py#L36 这个语句后，对dataset进行遍历来检测是否有错误：
for data in dataset:
    pass
或者将worker设置为0，这样也可以更直观的观察到是否是读取数据报错

好的，我这边有空去找一下原来的图片，定位一下原因，看看是图片本身的问题，还是程序有问题

Jun 29 '21 02:06 KernelErr

换为jpg格式后，训练到Iter 5000的时候，出现了一个错误：

[06/28 22:56:17] ppgan.engine.trainer INFO: Iter: 4950/800000 lr: 1.600e-03 l_d: 0.904 real_score: 3.060 fake_score: 0.183 l_g: 4.119 l_g_path: 0.003 path_length: 0.287 l_d_r1: 1.740 batch_cost: 1.70931 sec reader_cost: 0.00025 sec ips: 0.58503 images/s eta: 15 days, 17:29:49
INFO 2021-06-28 22:57:49,182 launch_utils.py:327] terminate all the procs
ERROR 2021-06-28 22:57:49,182 launch_utils.py:584] ABORT!!! Out of all 4 trainers, the trainer process with rank=[0, 1, 2, 3] was aborted. Please check its log.
INFO 2021-06-28 22:57:52,185 launch_utils.py:327] terminate all the procs
[06/28 22:57:42] ppgan.engine.trainer INFO: Iter: 5000/800000 lr: 1.600e-03 l_d: 0.245 real_score: 1.648 fake_score: -2.812 l_g: 2.259 l_g_path: 0.000 path_length: 0.268 l_d_r1: 2.520 batch_cost: 1.71203 sec reader_cost: 0.00026 sec ips: 0.58410 images/s eta: 15 days, 18:04:23
Traceback (most recent call last):
  File "/root/paddlejob/workspace/code/PaddleGAN-develop/ppgan/utils/config.py", line 24, in __getattr__
    return self[key]
KeyError: 'test'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "tools/main.py", line 56, in <module>
    main(args, cfg)
  File "tools/main.py", line 46, in main
    trainer.train()
  File "/root/paddlejob/workspace/code/PaddleGAN-develop/ppgan/engine/trainer.py", line 209, in train
    self.test()
  File "/root/paddlejob/workspace/code/PaddleGAN-develop/ppgan/engine/trainer.py", line 219, in test
    self.test_dataloader = build_dataloader(self.cfg.dataset.test,
  File "/root/paddlejob/workspace/code/PaddleGAN-develop/ppgan/utils/config.py", line 26, in __getattr__
    raise AttributeError(key)
AttributeError: test
/mnt
[INFO]: train job failed! train_ret: 1

将配置文件中的

validate:
  interval: 5000
  save_imig: False
  metrics:
    fid: # metric name, can be arbitrary
      name: FID
      batch_size: 4

注释掉后不会出现问题，不知道这个是为什么

这个应该是bug，我们尽快修复下

Jun 29 '21 02:06 LielinJiang

问题过于久远，如果有图像和视频生成的需求，可以使用新的跨模态工具: https://github.com/PaddlePaddle/PaddleMIX/tree/develop

Feb 29 '24 03:02 JunnYu