GeneFace
GeneFace copied to clipboard
训练Posenet时的报错
在训练Marcon的posenet时,报错如下,请问可能是什么问题?如何解决?谢谢!!!
4104step [00:00, ?step/s]ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
4104step [00:05, ?step/s]
Traceback (most recent call last):
File "/root/miniconda3/envs/geneface/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1011, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File "/root/miniconda3/envs/geneface/lib/python3.9/queue.py", line 180, in get
self.not_empty.wait(remaining)
File "/root/miniconda3/envs/geneface/lib/python3.9/threading.py", line 316, in wait
gotit = waiter.acquire(True, timeout)
File "/root/miniconda3/envs/geneface/lib/python3.9/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 7162) is killed by signal: Bus error. It is possible that dataloader's workers are out of shared memory. Please try to raise your shared memory limit.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/root/autodl-tmp/GeneFace/tasks/run.py", line 19, in <module>
run_task()
File "/root/autodl-tmp/GeneFace/tasks/run.py", line 14, in run_task
task_cls.start()
File "/root/autodl-tmp/GeneFace/utils/commons/base_task.py", line 251, in start
trainer.fit(cls)
File "/root/autodl-tmp/GeneFace/utils/commons/trainer.py", line 122, in fit
self.run_single_process(self.task)
File "/root/autodl-tmp/GeneFace/utils/commons/trainer.py", line 186, in run_single_process
self.train()
File "/root/autodl-tmp/GeneFace/utils/commons/trainer.py", line 283, in train
for batch_idx, batch in enumerate(train_pbar):
File "/root/miniconda3/envs/geneface/lib/python3.9/site-packages/tqdm/std.py", line 1178, in __iter__
for obj in iterable:
File "/root/miniconda3/envs/geneface/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 530, in __next__
data = self._next_data()
File "/root/miniconda3/envs/geneface/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1207, in _next_data
idx, data = self._get_data()
File "/root/miniconda3/envs/geneface/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1163, in _get_data
success, data = self._try_get_data()
File "/root/miniconda3/envs/geneface/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1024, in _try_get_data
raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 7162) exited unexpectedly
通过减小Dataloader中的 --num-workers, 或者设置为0, 应该可以解决这个问题
非常感谢您的回复! 一般跑到4000步以后才会报错,现在拿checkpoint4000来用感觉效果都还行。
的确是,我跑的时候大概是15K的时候才报错。但是对选择结果来说不太影响