PaddleGAN icon indicating copy to clipboard operation
PaddleGAN copied to clipboard

DIV2K数据集处理后有三万张两千张影像,直接训练报错。

Open kongdebug opened this issue 3 years ago • 5 comments

DIV2K有800张影像,使用process_div2k_data.py处理后,得到32000张影像,启动训练后一开始是可以的,但是训练到5000个iter后,进行验证和保存模型之后会报错,有关DataLoader的问题,使用--resume恢复训练没问题,这是什么原因呢?

kongdebug avatar Jul 21 '21 07:07 kongdebug

能贴一下您处理后的目录结构和报错截图不

LielinJiang avatar Jul 22 '21 02:07 LielinJiang

能贴一下您处理后的目录结构和报错截图不

处理后的目录跟教程一样,一开始是可以正常训练的。只是中途会报这样的错误:ERROR: Unexpected BUS error encountered in DataLoader worker. This might be caused by insufficient shared memory (shm), please check whether use_shared_memory is set and storage space in /dev/shm is enough ERROR:root:DataLoader reader thread raised an exception! Traceback (most recent call last): File "tools/main.py", line 56, in main(args, cfg) File "tools/main.py", line 46, in main trainer.train() File "/home/aistudio/PaddleGAN/ppgan/engine/trainer.py", line 170, in train Exception in thread Thread-10: Traceback (most recent call last): File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 482, in _get_data data = self._data_queue.get(timeout=self._timeout) File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/multiprocessing/queues.py", line 105, in get raise Empty _queue.Empty

ZivKidd avatar Jul 22 '21 04:07 ZivKidd

您是在宿主机运行嘛还是在docker中?

LielinJiang avatar Jul 22 '21 04:07 LielinJiang

您是在宿主机运行嘛还是在docker中?

是直接在AI studio上把paddleGAN给clone下来然后运行的

kongdebug avatar Jul 22 '21 06:07 kongdebug

噢噢,好的,看起来像是一个已知的dataloader共享内存泄露问题。 https://github.com/PaddlePaddle/PaddleGAN/blob/develop/configs/realsr_bicubic_noise_x4_df2k.yaml#L41 这边建议您在这一行后面添加一行,试试能否有效

use_shared_memory: False

后续我们会验证一下,看看在下个版本修复

LielinJiang avatar Jul 22 '21 08:07 LielinJiang

问题过于久远,如果有图像和视频生成的需求,可以使用新的跨模态工具: https://github.com/PaddlePaddle/PaddleMIX/tree/develop

JunnYu avatar Feb 29 '24 03:02 JunnYu