panpan-wu

Results 6 comments of panpan-wu

训练时用的是动态 batch,batch_type: 'dynamic'。当动态 batch,梯度累积和多 GPU 一起用时,得到的梯度的期望与梯度的真实值是否一致呢?

换成 static batch 后不会卡了。

目前我找到了一种绕过这个问题的办法:在 shell 中控制 epoch 循环次数,train.py 每次只训练一个 epoch。 ```bash #!/bin/bash # export CUDA_VISIBLE_DEVICES="0,1,2,3" export CUDA_VISIBLE_DEVICES="0,1,2,3" dir=exp/conformer cmvn=true # checkpoint=exp/20210815_unified_conformer_server/final.pt # checkpoint=exp/conformer/init.pt # bpemodel=conf/train_960_unigram5000.model # The NCCL_SOCKET_IFNAME variable specifies which IP interface...

batch conf 配置: ```yaml batch_conf: # batch_type: 'static' # static or dynamic batch_type: 'dynamic' # static or dynamic # batch_size: 4 max_frames_in_batch: 20000 ```

我对代码做了一些改动,不会卡了。 wenet/bin/train.py ```python for epoch in range(start_epoch, num_epochs): # 省略无关代码 if distributed: dist.barrier() ``` wenet/utils/executor.py ```python def cv(self, model, data_loader, device, args): """Cross validation on""" if isinstance(model, torch.nn.parallel.DistributedDataParallel): model =...

> @panpan-wu 所以dynamic要一直手动改epoch吗,捂脸 基本是吧,不过是使用 shell 脚本改的: for epoch in `seq $start_epoch $end_epoch`; do ... done 后来我改了一些代码,不会卡了,见评论。