FunASR
FunASR copied to clipboard
Some warnings for training
用finetune.sh在自有数据集上微调下载的speech_paraformer_asr_nat-zh-cn-8k-common-vocab8358-tensorflow1 模型时,有一些warning,如下: grad.sizes() = [1, 320], strides() = [1, 1] bucket_view.sizes() = [1, 320], strides() = [320, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:325.) Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass /usr/local/lib/python3.8/site-packages/torch/autograd/init.py:197: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance.
问题:不知道上述问题严重不,会不会真的会影响训练性能,或者应该怎么修正?
**详细的打印log如下: [2024-03-20 11:52:34,002][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:1 to store for rank: 3 [2024-03-20 11:52:34,060][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:1 to store for rank: 1 [2024-03-20 11:52:34,064][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:1 to store for rank: 0 [2024-03-20 11:52:34,082][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:1 to store for rank: 2 [2024-03-20 11:52:34,136][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:1 to store for rank: 6 [2024-03-20 11:52:34,141][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:1 to store for rank: 5 [2024-03-20 11:52:34,145][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:1 to store for rank: 7 [2024-03-20 11:52:34,148][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:1 to store for rank: 4 [2024-03-20 11:52:34,148][torch.distributed.distributed_c10d][INFO] - Rank 4: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes. [2024-03-20 11:52:34,151][torch.distributed.distributed_c10d][INFO] - Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes. [2024-03-20 11:52:34,152][torch.distributed.distributed_c10d][INFO] - Rank 5: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes. [2024-03-20 11:52:34,153][torch.distributed.distributed_c10d][INFO] - Rank 2: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes. [2024-03-20 11:52:34,155][torch.distributed.distributed_c10d][INFO] - Rank 3: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes. [2024-03-20 11:52:34,155][torch.distributed.distributed_c10d][INFO] - Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes. [2024-03-20 11:52:34,156][torch.distributed.distributed_c10d][INFO] - Rank 7: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes. [2024-03-20 11:52:34,156][torch.distributed.distributed_c10d][INFO] - Rank 6: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes. [2024-03-20 11:52:34,511][root][INFO] - config.yaml is saved to: ./exp/offline_8k_paraformer/config.yaml [2024-03-20 11:52:35,126][root][INFO] - init_param is not None: ('exp/speech_paraformer_asr_nat-zh-cn-8k-common-vocab8358-tensorflow1/model.pt',) [2024-03-20 11:52:35,127][root][INFO] - Loading pretrained params from exp/speech_paraformer_asr_nat-zh-cn-8k-common-vocab8358-tensorflow1/model.pt ckpt: exp/speech_paraformer_asr_nat-zh-cn-8k-common-vocab8358-tensorflow1/model.pt [2024-03-20 11:52:35,139][root][INFO] - init_param is not None: ('exp/speech_paraformer_asr_nat-zh-cn-8k-common-vocab8358-tensorflow1/model.pt',) [2024-03-20 11:52:35,139][root][INFO] - Loading pretrained params from exp/speech_paraformer_asr_nat-zh-cn-8k-common-vocab8358-tensorflow1/model.pt ckpt: exp/speech_paraformer_asr_nat-zh-cn-8k-common-vocab8358-tensorflow1/model.pt [2024-03-20 11:52:35,155][root][INFO] - init_param is not None: ('exp/speech_paraformer_asr_nat-zh-cn-8k-common-vocab8358-tensorflow1/model.pt',) [2024-03-20 11:52:35,155][root][INFO] - Loading pretrained params from exp/speech_paraformer_asr_nat-zh-cn-8k-common-vocab8358-tensorflow1/model.pt [2024-03-20 11:52:35,156][root][INFO] - init_param is not None: ('exp/speech_paraformer_asr_nat-zh-cn-8k-common-vocab8358-tensorflow1/model.pt',) [2024-03-20 11:52:35,156][root][INFO] - Loading pretrained params from exp/speech_paraformer_asr_nat-zh-cn-8k-common-vocab8358-tensorflow1/model.pt [2024-03-20 11:52:35,156][root][INFO] - init_param is not None: ('exp/speech_paraformer_asr_nat-zh-cn-8k-common-vocab8358-tensorflow1/model.pt',) [2024-03-20 11:52:35,156][root][INFO] - Loading pretrained params from exp/speech_paraformer_asr_nat-zh-cn-8k-common-vocab8358-tensorflow1/model.pt ckpt: exp/speech_paraformer_asr_nat-zh-cn-8k-common-vocab8358-tensorflow1/model.pt ckpt: exp/speech_paraformer_asr_nat-zh-cn-8k-common-vocab8358-tensorflow1/model.pt ckpt: exp/speech_paraformer_asr_nat-zh-cn-8k-common-vocab8358-tensorflow1/model.pt [2024-03-20 11:52:35,176][root][INFO] - init_param is not None: ('exp/speech_paraformer_asr_nat-zh-cn-8k-common-vocab8358-tensorflow1/model.pt',) [2024-03-20 11:52:35,177][root][INFO] - Loading pretrained params from exp/speech_paraformer_asr_nat-zh-cn-8k-common-vocab8358-tensorflow1/model.pt ckpt: exp/speech_paraformer_asr_nat-zh-cn-8k-common-vocab8358-tensorflow1/model.pt [2024-03-20 11:52:35,438][root][INFO] - init_param is not None: ('exp/speech_paraformer_asr_nat-zh-cn-8k-common-vocab8358-tensorflow1/model.pt',) [2024-03-20 11:52:35,439][root][INFO] - Loading pretrained params from exp/speech_paraformer_asr_nat-zh-cn-8k-common-vocab8358-tensorflow1/model.pt ckpt: exp/speech_paraformer_asr_nat-zh-cn-8k-common-vocab8358-tensorflow1/model.pt [2024-03-20 11:52:35,465][root][INFO] - init_param is not None: ('exp/speech_paraformer_asr_nat-zh-cn-8k-common-vocab8358-tensorflow1/model.pt',) [2024-03-20 11:52:35,468][root][INFO] - Loading pretrained params from exp/speech_paraformer_asr_nat-zh-cn-8k-common-vocab8358-tensorflow1/model.pt ckpt: exp/speech_paraformer_asr_nat-zh-cn-8k-common-vocab8358-tensorflow1/model.pt [2024-03-20 11:52:52,484][root][INFO] - total_num of samplers across ranks: 2042300 [2024-03-20 11:52:52,491][root][INFO] - total_num of samplers across ranks: 2042300 [2024-03-20 11:52:52,512][root][INFO] - total_num of samplers across ranks: 2042300 [2024-03-20 11:52:52,516][root][INFO] - total_num of samplers across ranks: 5281 [2024-03-20 11:52:52,520][root][INFO] - total_num of samplers across ranks: 5281 No checkpoint found at './exp/offline_8k_paraformer/model.pt', does not resume status!No checkpoint found at './exp/offline_8k_paraformer/model.pt', does not resume status!
[2024-03-20 11:52:52,542][root][INFO] - total_num of samplers across ranks: 5281 [2024-03-20 11:52:52,547][root][INFO] - total_num of samplers across ranks: 2042300 No checkpoint found at './exp/offline_8k_paraformer/model.pt', does not resume status! [2024-03-20 11:52:52,577][root][INFO] - total_num of samplers across ranks: 5281 No checkpoint found at './exp/offline_8k_paraformer/model.pt', does not resume status! [2024-03-20 11:52:52,592][root][INFO] - total_num of samplers across ranks: 2042300 [2024-03-20 11:52:52,622][root][INFO] - total_num of samplers across ranks: 5281 No checkpoint found at './exp/offline_8k_paraformer/model.pt', does not resume status! [2024-03-20 11:52:52,675][root][INFO] - total_num of samplers across ranks: 2042300 [2024-03-20 11:52:52,705][root][INFO] - total_num of samplers across ranks: 2042300 [2024-03-20 11:52:52,706][root][INFO] - total_num of samplers across ranks: 5281 No checkpoint found at './exp/offline_8k_paraformer/model.pt', does not resume status! [2024-03-20 11:52:52,735][root][INFO] - total_num of samplers across ranks: 5281 No checkpoint found at './exp/offline_8k_paraformer/model.pt', does not resume status! [2024-03-20 11:52:52,769][root][INFO] - total_num of samplers across ranks: 2042300 [2024-03-20 11:52:52,800][root][INFO] - total_num of samplers across ranks: 5281 No checkpoint found at './exp/offline_8k_paraformer/model.pt', does not resume status! rank: 2, Training Epoch: 1: 0%| | 0/63822 [00:00<?, ?it/s]/usr/local/lib/python3.8/site-packages/torch/autograd/init.py:197: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance. grad.sizes() = [1, 320], strides() = [1, 1] bucket_view.sizes() = [1, 320], strides() = [320, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:325.) Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass /usr/local/lib/python3.8/site-packages/torch/autograd/init.py:197: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance. grad.sizes() = [1, 320], strides() = [1, 1] bucket_view.sizes() = [1, 320], strides() = [320, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:325.) Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass /usr/local/lib/python3.8/site-packages/torch/autograd/init.py:197: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance. grad.sizes() = [1, 320], strides() = [1, 1] bucket_view.sizes() = [1, 320], strides() = [320, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:325.) Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass /usr/local/lib/python3.8/site-packages/torch/autograd/init.py:197: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance. grad.sizes() = [1, 320], strides() = [1, 1] bucket_view.sizes() = [1, 320], strides() = [320, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:325.) Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass /usr/local/lib/python3.8/site-packages/torch/autograd/init.py:197: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance. grad.sizes() = [1, 320], strides() = [1, 1] bucket_view.sizes() = [1, 320], strides() = [320, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:325.) Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass /usr/local/lib/python3.8/site-packages/torch/autograd/init.py:197: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance. grad.sizes() = [1, 320], strides() = [1, 1] bucket_view.sizes() = [1, 320], strides() = [320, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:325.) Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass /usr/local/lib/python3.8/site-packages/torch/autograd/init.py:197: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance. grad.sizes() = [1, 320], strides() = [1, 1] bucket_view.sizes() = [1, 320], strides() = [320, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:325.) Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass /usr/local/lib/python3.8/site-packages/torch/autograd/init.py:197: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance. grad.sizes() = [1, 320], strides() = [1, 1] bucket_view.sizes() = [1, 320], strides() = [320, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:325.) Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [2024-03-20 11:52:56,647][torch.nn.parallel.distributed][INFO] - Reducer buckets have been rebuilt in this iteration. [2024-03-20 11:52:56,649][torch.nn.parallel.distributed][INFO] - Reducer buckets have been rebuilt in this iteration. [2024-03-20 11:52:56,649][torch.nn.parallel.distributed][INFO] - Reducer buckets have been rebuilt in this iteration. [2024-03-20 11:52:56,650][torch.nn.parallel.distributed][INFO] - Reducer buckets have been rebuilt in this iteration. [2024-03-20 11:52:56,650][torch.nn.parallel.distributed][INFO] - Reducer buckets have been rebuilt in this iteration. [2024-03-20 11:52:56,651][torch.nn.parallel.distributed][INFO] - Reducer buckets have been rebuilt in this iteration. [2024-03-20 11:52:56,651][torch.nn.parallel.distributed][INFO] - Reducer buckets have been rebuilt in this iteration. [2024-03-20 11:52:56,651][torch.nn.parallel.distributed][INFO] - Reducer buckets have been rebuilt in this iteration. 2024-03-20 11:53:32, rank: 6, epoch: 0/20, step: 50/63822, total step: 50, (loss: 0.763), (lr: 3.400e-07), [('loss_att', 0.696), ('acc', 0.615), ('loss_pre', 0.067), ('loss', 0.763)], {'data_load': '0.000', 'forward_time': '0.627', 'backward_time': '0.202', 'optim_time': '0.125', 'total_time': '0.955'}, GPU, memory: 1.186 GB, 7.515 GB, 4.203 GB, 8.023 GB: 0%| | 50/2024-03-20 11:53:32, rank: 7, epoch: 0/20, step: 50/63822, total step: 50, (loss: 0.763), (lr: 3.400e-07), [('loss_att', 0.696), ('acc', 0.615), ('loss_pre', 0.067), ('loss', 0.763)], {'data_load': '0.000', 'forward_time': '0.700', 'backward_time': '0.147', 'optim_time': '0.107', 'total_time': '0.955'}, GPU, memory: 1.187 GB, 7.515 GB, 4.205 GB, 8.021 GB: 0%| | 50/2024-03-20 11:53:32, rank: 4, epoch: 0/20, step: 50/63822, total step: 50, (loss: 0.763), (lr: 3.400e-07), [('loss_att', 0.696), ('acc', 0.615), ('loss_pre', 0.067), ('loss', 0.763)], {'data_load': '0.000', 'forward_time': '0.676', 'backward_time': '0.154', 'optim_time': '0.125', 'total_time': '0.955'}, GPU, memory: 1.185 GB, 7.515 GB, 4.205 GB, 8.021 GB: 0%| | 50/2024-03-20 11:53:32, rank: 0, epoch: 0/20, step: 50/63822, total step: 50, (loss: 0.763), (lr: 3.400e-07), [('loss_att', 0.696), ('acc', 0.615), ('loss_pre', 0.067), ('loss', 0.763)], {'data_load': '0.000', 'forward_time': '0.690', 'backward_time': '0.158', 'optim_time': '0.107', 'total_time': '0.955'}, GPU, memory: 1.183 GB, 7.514 GB, 4.227 GB, 7.953 GB: 0%| | 50/2024-03-20 11:53:32, rank: 1, epoch: 0/20, step: 50/63822, total step: 50, (loss: 0.763), (lr: 3.400e-07), [('loss_att', 0.696), ('acc', 0.615), ('loss_pre', 0.067), ('loss', 0.763)], {'data_load': '0.000', 'forward_time': '0.691', 'backward_time': '0.157', 'optim_time': '0.107', 'total_time': '0.955'}, GPU, memory: 1.187 GB, 7.514 GB, 4.203 GB, 8.002 GB: 0%| | 50/2024-03-20 11:53:32, rank: 2, epoch: 0/20, step: 50/63822, total step: 50, (loss: 0.763), (lr: 3.400e-07), [('loss_att', 0.696), ('acc', 0.615), ('loss_pre', 0.067), ('loss', 0.763)], {'data_load': '0.000', 'forward_time': '0.691', 'backward_time': '0.159', 'optim_time': '0.105', 'total_time': '0.956'}, GPU, memory: 1.186 GB, 7.515 GB, 4.209 GB, 8.004 GB: 0%| | 50/
❓ Help
Code########################
exp output dir
output_dir="./exp/offline_8k_paraformer" log_file="${output_dir}/log.txt"
mkdir -p ${output_dir} echo "log_file: ${log_file}"
torchrun
--nnodes 1
--nproc_per_node ${gpu_num}
../../../funasr/bin/train.py
++model="exp/speech_paraformer_asr_nat-zh-cn-8k-common-vocab8358-tensorflow1"
++model_revision="v2.0.4"
++train_data_set_list="${train_data}"
++valid_data_set_list="${val_data}"
++dataset_conf.batch_size=32
++dataset_conf.batch_type="example"
++dataset_conf.num_workers=4
++train_conf.max_epoch=20
++optim_conf.lr=0.0002
++output_dir="${output_dir}" &> ${log_file}
What's your environment?
- OS (e.g., Linux):Debian GNU/Linux 10
- FunASR Version (e.g., 1.0.0):1.0.15
- ModelScope Version (e.g., 1.11.0):1.110
- PyTorch Version (e.g., 2.0.0):1.13.0+cu117
- How you installed funasr (
pip, source):source - Python version:3.8
- GPU (e.g., V100M32):V100M32
We have update the training with a bugfix of OOM. Please update funasr and try it again.
We have update the training with a bugfix of OOM. Please update funasr and try it again. Thanks for the updata,the paraformer_streaming is already for normal training; For this problem,when i updata the code,it still reports the same warning(So I decided to train for a while and then test whether there are obvious problems.): [2024-03-21 11:15:13,728][root][INFO] - Train epoch: 0, rank: 1 [2024-03-21 11:15:13,729][root][INFO] - Train epoch: 0, rank: 3 [2024-03-21 11:15:13,730][root][INFO] - Train epoch: 0, rank: 7 [2024-03-21 11:15:13,730][root][INFO] - Train epoch: 0, rank: 2 [2024-03-21 11:15:13,732][root][INFO] - Train epoch: 0, rank: 0 /usr/local/lib/python3.8/site-packages/torch/autograd/init.py:197: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance. grad.sizes() = [1, 320], strides() = [1, 1] bucket_view.sizes() = [1, 320], strides() = [320, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:325.) Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass /usr/local/lib/python3.8/site-packages/torch/autograd/init.py:197: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance. grad.sizes() = [1, 320], strides() = [1, 1] bucket_view.sizes() = [1, 320], strides() = [320, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:325.) Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
Please update torch
Please update torch
Thanks again!