OpenDiT icon indicating copy to clipboard operation
OpenDiT copied to clipboard

读取数据报错,numworker改成0才稳定,请问是什么问题?

Open ersanliqiao opened this issue 1 year ago • 4 comments

Traceback (most recent call last): File "xx/OpenDiT-master/train.py", line 383, in main(args) File xx/OpenDiT-master/train.py", line 275, in main batch = next(dataloader_iter) File "/root/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 631, in next data = self._next_data() File "/root/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1329, in _next_data idx, data = self._get_data() File "/root/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1285, in _get_data success, data = self._try_get_data() File "/root/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1146, in _try_get_data raise RuntimeError(f'DataLoader worker (pid(s) {pids_str}) exited unexpectedly') from e RuntimeError: DataLoader worker (pid(s) 2365846) exited unexpectedly

ersanliqiao avatar Mar 08 '24 02:03 ersanliqiao

How did you launch your script?

KKZ20 avatar Mar 08 '24 07:03 KKZ20

I think memory may leak, get bigger than 600G, so been killed

ersanliqiao avatar Mar 08 '24 12:03 ersanliqiao

目前是发现跑几十万数据,内存一直增长

ersanliqiao avatar Mar 11 '24 06:03 ersanliqiao

it about torch dataloader. you can use gc collect to avoid this problem

oahzxl avatar Mar 15 '24 06:03 oahzxl