StyleGAN多卡任务出现DataLoader段错误
环境
- PaddleGAN: Commit 97f96b9
- AI Studio脚本任务默认环境
- Numpy降级到1.15(由于AI Studio环境下scikit-image版本过低,且升级困难,需要将numpy降级)
问题
AI Studio脚本任务4卡训练StyleGAN2时,提示DataLoader段错误。
日志
W0627 16:19:49.616700 868 device_context.cc:404] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 10.1, Runtime API Version: 10.1
W0627 16:19:49.621624 868 device_context.cc:422] device: 0, cuDNN Version: 7.6.
W0627 16:19:59.556109 868 gen_comm_id_helper.cc:120] connect addr=127.0.0.1:54470 failed 1 times with reason: Connection refused retry after 0.5 seconds
I0627 16:20:00.056545 868 nccl_context.cc:74] init nccl context nranks: 4 local rank: 0 gpu id: 0 ring id: 0
ERROR: Unexpected segmentation fault encountered in DataLoader workers.
ERROR:root:DataLoader reader thread raised an exception!Traceback (most recent call last):
File "tools/main.py", line 56, in <module>
main(args, cfg)
File "tools/main.py", line 46, in main
trainer.train()
File "/root/paddlejob/workspace/code/PaddleGAN-develop/ppgan/engine/trainer.py", line 181, in train
data = next(iter_loader)
File "/root/paddlejob/workspace/code/PaddleGAN-develop/ppgan/engine/trainer.py", line 44, in __next__
Exception in thread Thread-1:
Traceback (most recent call last):
File "/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 482, in _get_data
data = self._data_queue.get(timeout=self._timeout)
File "/opt/_internal/cpython-3.7.0/lib/python3.7/multiprocessing/queues.py", line 105, in get
raise Empty
_queue.Empty
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/_internal/cpython-3.7.0/lib/python3.7/threading.py", line 917, in _bootstrap_inner
self.run()
File "/opt/_internal/cpython-3.7.0/lib/python3.7/threading.py", line 865, in run
self._target(*self._args, **self._kwargs)
File "/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 411, in _thread_loop
batch = self._get_data()
File "/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 498, in _get_data
"pids: {}".format(len(failed_workers), pids))
RuntimeError: DataLoader 1 workers exit unexpectedly, pids: 1063
data = next(self.iter_loader)
File "/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 585, in __next__
data = self._reader.read_next_var_list()
SystemError: (Fatal) Blocking queue is killed because the data reader raises an exception.
[Hint: Expected killed_ != true, but received killed_:1 == true:1.] (at /paddle/paddle/fluid/operators/reader/blocking_queue.h:166)
切换成Notebook,单卡训练也出现此问题,训练集图片大约5w张图片,减少图片数量后不会报错
感谢反馈,我们先尝试复现一下
感谢反馈,我们先尝试复现一下
昨天又检查了一下,怀疑是图片格式的问题,原本的图片名中带有中文和日文,所以预处理的时候OpenCV用的下面的导出方法:
cv2.imencode('.png', image)[1].tofile(save_path)
整个数据集有50k张图片, png格式的,5G大小。这样的数据会出现段错误。
如果将文件名改成纯数字,并且采用jpg格式,使用OpenCV的imwrite去导出,数据集只有1G大小,且不会出现段错误。
感谢反馈,我们先尝试复现一下
昨天又检查了一下,怀疑是图片格式的问题,原本的图片名中带有中文和日文,所以预处理的时候OpenCV用的下面的导出方法:
cv2.imencode('.png', image)[1].tofile(save_path)整个数据集有50k张图片, png格式的,5G大小。这样的数据会出现段错误。
如果将文件名改成纯数字,并且采用jpg格式,使用OpenCV的
imwrite去导出,数据集只有1G大小,且不会出现段错误。
好的,如果是怀疑图片读取的问题,可以在https://github.com/PaddlePaddle/PaddleGAN/blob/develop/ppgan/datasets/builder.py#L36 这个语句后,对dataset进行遍历来检测是否有错误:
for data in dataset:
pass
或者将worker设置为0,这样也可以更直观的观察到是否是读取数据报错
换为jpg格式后,训练到Iter 5000的时候,出现了一个错误:
[06/28 22:56:17] ppgan.engine.trainer INFO: Iter: 4950/800000 lr: 1.600e-03 l_d: 0.904 real_score: 3.060 fake_score: 0.183 l_g: 4.119 l_g_path: 0.003 path_length: 0.287 l_d_r1: 1.740 batch_cost: 1.70931 sec reader_cost: 0.00025 sec ips: 0.58503 images/s eta: 15 days, 17:29:49
INFO 2021-06-28 22:57:49,182 launch_utils.py:327] terminate all the procs
ERROR 2021-06-28 22:57:49,182 launch_utils.py:584] ABORT!!! Out of all 4 trainers, the trainer process with rank=[0, 1, 2, 3] was aborted. Please check its log.
INFO 2021-06-28 22:57:52,185 launch_utils.py:327] terminate all the procs
[06/28 22:57:42] ppgan.engine.trainer INFO: Iter: 5000/800000 lr: 1.600e-03 l_d: 0.245 real_score: 1.648 fake_score: -2.812 l_g: 2.259 l_g_path: 0.000 path_length: 0.268 l_d_r1: 2.520 batch_cost: 1.71203 sec reader_cost: 0.00026 sec ips: 0.58410 images/s eta: 15 days, 18:04:23
Traceback (most recent call last):
File "/root/paddlejob/workspace/code/PaddleGAN-develop/ppgan/utils/config.py", line 24, in __getattr__
return self[key]
KeyError: 'test'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "tools/main.py", line 56, in <module>
main(args, cfg)
File "tools/main.py", line 46, in main
trainer.train()
File "/root/paddlejob/workspace/code/PaddleGAN-develop/ppgan/engine/trainer.py", line 209, in train
self.test()
File "/root/paddlejob/workspace/code/PaddleGAN-develop/ppgan/engine/trainer.py", line 219, in test
self.test_dataloader = build_dataloader(self.cfg.dataset.test,
File "/root/paddlejob/workspace/code/PaddleGAN-develop/ppgan/utils/config.py", line 26, in __getattr__
raise AttributeError(key)
AttributeError: test
/mnt
[INFO]: train job failed! train_ret: 1
将配置文件中的
validate:
interval: 5000
save_imig: False
metrics:
fid: # metric name, can be arbitrary
name: FID
batch_size: 4
注释掉后不会出现问题,不知道这个是为什么
感谢反馈,我们先尝试复现一下
昨天又检查了一下,怀疑是图片格式的问题,原本的图片名中带有中文和日文,所以预处理的时候OpenCV用的下面的导出方法:
cv2.imencode('.png', image)[1].tofile(save_path)整个数据集有50k张图片, png格式的,5G大小。这样的数据会出现段错误。 如果将文件名改成纯数字,并且采用jpg格式,使用OpenCV的
imwrite去导出,数据集只有1G大小,且不会出现段错误。好的,如果是怀疑图片读取的问题,可以在https://github.com/PaddlePaddle/PaddleGAN/blob/develop/ppgan/datasets/builder.py#L36 这个语句后,对dataset进行遍历来检测是否有错误:
for data in dataset: pass或者将worker设置为0,这样也可以更直观的观察到是否是读取数据报错
好的,我这边有空去找一下原来的图片,定位一下原因,看看是图片本身的问题,还是程序有问题
换为jpg格式后,训练到Iter 5000的时候,出现了一个错误:
[06/28 22:56:17] ppgan.engine.trainer INFO: Iter: 4950/800000 lr: 1.600e-03 l_d: 0.904 real_score: 3.060 fake_score: 0.183 l_g: 4.119 l_g_path: 0.003 path_length: 0.287 l_d_r1: 1.740 batch_cost: 1.70931 sec reader_cost: 0.00025 sec ips: 0.58503 images/s eta: 15 days, 17:29:49 INFO 2021-06-28 22:57:49,182 launch_utils.py:327] terminate all the procs ERROR 2021-06-28 22:57:49,182 launch_utils.py:584] ABORT!!! Out of all 4 trainers, the trainer process with rank=[0, 1, 2, 3] was aborted. Please check its log. INFO 2021-06-28 22:57:52,185 launch_utils.py:327] terminate all the procs [06/28 22:57:42] ppgan.engine.trainer INFO: Iter: 5000/800000 lr: 1.600e-03 l_d: 0.245 real_score: 1.648 fake_score: -2.812 l_g: 2.259 l_g_path: 0.000 path_length: 0.268 l_d_r1: 2.520 batch_cost: 1.71203 sec reader_cost: 0.00026 sec ips: 0.58410 images/s eta: 15 days, 18:04:23 Traceback (most recent call last): File "/root/paddlejob/workspace/code/PaddleGAN-develop/ppgan/utils/config.py", line 24, in __getattr__ return self[key] KeyError: 'test' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "tools/main.py", line 56, in <module> main(args, cfg) File "tools/main.py", line 46, in main trainer.train() File "/root/paddlejob/workspace/code/PaddleGAN-develop/ppgan/engine/trainer.py", line 209, in train self.test() File "/root/paddlejob/workspace/code/PaddleGAN-develop/ppgan/engine/trainer.py", line 219, in test self.test_dataloader = build_dataloader(self.cfg.dataset.test, File "/root/paddlejob/workspace/code/PaddleGAN-develop/ppgan/utils/config.py", line 26, in __getattr__ raise AttributeError(key) AttributeError: test /mnt [INFO]: train job failed! train_ret: 1将配置文件中的
validate: interval: 5000 save_imig: False metrics: fid: # metric name, can be arbitrary name: FID batch_size: 4注释掉后不会出现问题,不知道这个是为什么
这个应该是bug,我们尽快修复下
问题过于久远,如果有图像和视频生成的需求,可以使用新的跨模态工具: https://github.com/PaddlePaddle/PaddleMIX/tree/develop