wenet Train with shard mode break after every epoch

Train with shard mode break after every epoch

Open carcloudfly opened this issue 3 years ago • 2 comments

Describe the bug While I'm trying to train my model with shard data mode using gloo, I got an error after every epoch and my task stoped .like this : /opt/conda/envs/wenet/lib/python3.8/multiprocessing/process.py:108: ResourceWarning: unclosed file <_io.BufferedReader name='/data/wenet_aishell2_shards/train/shards_000005738.tar'> self._target(*self._args, **self._kwargs) ResourceWarning: Enable tracemalloc to get the object allocation traceback Traceback (most recent call last): File "wenet/bin/train.py", line 262, in main() File "wenet/bin/train.py", line 235, in main executor.train(model, optimizer, scheduler, train_data_loader, device, File "/workspace/wenet/examples/aishell2/s0/wenet/utils/executor.py", line 71, in train loss.backward() File "/opt/conda/envs/wenet/lib/python3.8/site-packages/torch/_tensor.py", line 255, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "/opt/conda/envs/wenet/lib/python3.8/site-packages/torch/autograd/init.py", line 147, in backward Variable._execution_engine.run_backward(

To Reproduce Steps to reproduce the behavior: