wenet icon indicating copy to clipboard operation
wenet copied to clipboard

Train with shard mode break after every epoch

Open carcloudfly opened this issue 3 years ago • 2 comments

Describe the bug While I'm trying to train my model with shard data mode using gloo, I got an error after every epoch and my task stoped .like this : /opt/conda/envs/wenet/lib/python3.8/multiprocessing/process.py:108: ResourceWarning: unclosed file <_io.BufferedReader name='/data/wenet_aishell2_shards/train/shards_000005738.tar'> self._target(*self._args, **self._kwargs) ResourceWarning: Enable tracemalloc to get the object allocation traceback Traceback (most recent call last): File "wenet/bin/train.py", line 262, in main() File "wenet/bin/train.py", line 235, in main executor.train(model, optimizer, scheduler, train_data_loader, device, File "/workspace/wenet/examples/aishell2/s0/wenet/utils/executor.py", line 71, in train loss.backward() File "/opt/conda/envs/wenet/lib/python3.8/site-packages/torch/_tensor.py", line 255, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "/opt/conda/envs/wenet/lib/python3.8/site-packages/torch/autograd/init.py", line 147, in backward Variable._execution_engine.run_backward(

To Reproduce Steps to reproduce the behavior:

  1. train model with shard mode by set --data_type "shard"
  2. wait for at least one epoch complete,you my get it.

Expected behavior train with shard mode successfully all the time

Screenshots 图片

Desktop (please complete the following information):

  • OS: [LINUX x86_64]
  • Version [commit f972951275261ed14f3ba10f1b70716970f758ec (HEAD -> main)]

carcloudfly avatar Nov 04 '21 01:11 carcloudfly

图片

carcloudfly avatar Nov 04 '21 01:11 carcloudfly

Is it the latest code? It should be close by https://github.com/wenet-e2e/wenet/blob/main/wenet/dataset/processor.py#L109.

robin1001 avatar Nov 05 '21 02:11 robin1001

fiexd, close this issue

xingchensong avatar Feb 21 '23 06:02 xingchensong