用finetune.sh在自有数据集上微调下载的speech_paraformer_asr_nat-zh-cn-8k-common-vocab8358-tensorflow1 模型时，有一些warning，如下： grad.sizes() = [1, 320], strides() = [1, 1] bucket_view.sizes() = [1, 320], strides() = [320, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:325.) Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass /usr/local/lib/python3.8/site-packages/torch/autograd/init.py:197: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance.

问题：不知道上述问题严重不，会不会真的会影响训练性能，或者应该怎么修正？

**详细的打印log如下： [2024-03-20 11:52:34,002][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:1 to store for rank: 3 [2024-03-20 11:52:34,060][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:1 to store for rank: 1 [2024-03-20 11:52:34,064][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:1 to store for rank: 0 [2024-03-20 11:52:34,082][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:1 to store for rank: 2 [2024-03-20 11:52:34,136][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:1 to store for rank: 6 [2024-03-20 11:52:34,141][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:1 to store for rank: 5 [2024-03-20 11:52:34,145][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:1 to store for rank: 7 [2024-03-20 11:52:34,148][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:1 to store for rank: 4 [2024-03-20 11:52:34,148][torch.distributed.distributed_c10d][INFO] - Rank 4: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes. [2024-03-20 11:52:34,151][torch.distributed.distributed_c10d][INFO] - Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes. [2024-03-20 11:52:34,152][torch.distributed.distributed_c10d][INFO] - Rank 5: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes. [2024-03-20 11:52:34,153][torch.distributed.distributed_c10d][INFO] - Rank 2: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes. [2024-03-20 11:52:34,155][torch.distributed.distributed_c10d][INFO] - Rank 3: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes. [2024-03-20 11:52:34,155][torch.distributed.distributed_c10d][INFO] - Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes. [2024-03-20 11:52:34,156][torch.distributed.distributed_c10d][INFO] - Rank 7: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes. [2024-03-20 11:52:34,156][torch.distributed.distributed_c10d][INFO] - Rank 6: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes. [2024-03-20 11:52:34,511][root][INFO] - config.yaml is saved to: ./exp/offline_8k_paraformer/config.yaml [2024-03-20 11:52:35,126][root][INFO] - init_param is not None: ('exp/speech_paraformer_asr_nat-zh-cn-8k-common-vocab8358-tensorflow1/model.pt',) [2024-03-20 11:52:35,127][root][INFO] - Loading pretrained params from exp/speech_paraformer_asr_nat-zh-cn-8k-common-vocab8358-tensorflow1/model.pt ckpt: exp/speech_paraformer_asr_nat-zh-cn-8k-common-vocab8358-tensorflow1/model.pt [2024-03-20 11:52:35,139][root][INFO] - init_param is not None: ('exp/speech_paraformer_asr_nat-zh-cn-8k-common-vocab8358-tensorflow1/model.pt',) [2024-03-20 11:52:35,139][root][INFO] - Loading pretrained params from exp/speech_paraformer_asr_nat-zh-cn-8k-common-vocab8358-tensorflow1/model.pt ckpt: exp/speech_paraformer_asr_nat-zh-cn-8k-common-vocab8358-tensorflow1/model.pt [2024-03-20 11:52:35,155][root][INFO] - init_param is not None: ('exp/speech_paraformer_asr_nat-zh-cn-8k-common-vocab8358-tensorflow1/model.pt',) [2024-03-20 11:52:35,155][root][INFO] - Loading pretrained params from exp/speech_paraformer_asr_nat-zh-cn-8k-common-vocab8358-tensorflow1/model.pt [2024-03-20 11:52:35,156][root][INFO] - init_param is not None: ('exp/speech_paraformer_asr_nat-zh-cn-8k-common-vocab8358-tensorflow1/model.pt',) [2024-03-20 11:52:35,156][root][INFO] - Loading pretrained params from exp/speech_paraformer_asr_nat-zh-cn-8k-common-vocab8358-tensorflow1/model.pt [2024-03-20 11:52:35,156][root][INFO] - init_param is not None: ('exp/speech_paraformer_asr_nat-zh-cn-8k-common-vocab8358-tensorflow1/model.pt',) [2024-03-20 11:52:35,156][root][INFO] - Loading pretrained params from exp/speech_paraformer_asr_nat-zh-cn-8k-common-vocab8358-tensorflow1/model.pt ckpt: exp/speech_paraformer_asr_nat-zh-cn-8k-common-vocab8358-tensorflow1/model.pt ckpt: exp/speech_paraformer_asr_nat-zh-cn-8k-common-vocab8358-tensorflow1/model.pt ckpt: exp/speech_paraformer_asr_nat-zh-cn-8k-common-vocab8358-tensorflow1/model.pt [2024-03-20 11:52:35,176][root][INFO] - init_param is not None: ('exp/speech_paraformer_asr_nat-zh-cn-8k-common-vocab8358-tensorflow1/model.pt',) [2024-03-20 11:52:35,177][root][INFO] - Loading pretrained params from exp/speech_paraformer_asr_nat-zh-cn-8k-common-vocab8358-tensorflow1/model.pt ckpt: exp/speech_paraformer_asr_nat-zh-cn-8k-common-vocab8358-tensorflow1/model.pt [2024-03-20 11:52:35,438][root][INFO] - init_param is not None: ('exp/speech_paraformer_asr_nat-zh-cn-8k-common-vocab8358-tensorflow1/model.pt',) [2024-03-20 11:52:35,439][root][INFO] - Loading pretrained params from exp/speech_paraformer_asr_nat-zh-cn-8k-common-vocab8358-tensorflow1/model.pt ckpt: exp/speech_paraformer_asr_nat-zh-cn-8k-common-vocab8358-tensorflow1/model.pt [2024-03-20 11:52:35,465][root][INFO] - init_param is not None: ('exp/speech_paraformer_asr_nat-zh-cn-8k-common-vocab8358-tensorflow1/model.pt',) [2024-03-20 11:52:35,468][root][INFO] - Loading pretrained params from exp/speech_paraformer_asr_nat-zh-cn-8k-common-vocab8358-tensorflow1/model.pt ckpt: exp/speech_paraformer_asr_nat-zh-cn-8k-common-vocab8358-tensorflow1/model.pt [2024-03-20 11:52:52,484][root][INFO] - total_num of samplers across ranks: 2042300 [2024-03-20 11:52:52,491][root][INFO] - total_num of samplers across ranks: 2042300 [2024-03-20 11:52:52,512][root][INFO] - total_num of samplers across ranks: 2042300 [2024-03-20 11:52:52,516][root][INFO] - total_num of samplers across ranks: 5281 [2024-03-20 11:52:52,520][root][INFO] - total_num of samplers across ranks: 5281 No checkpoint found at './exp/offline_8k_paraformer/model.pt', does not resume status!No checkpoint found at './exp/offline_8k_paraformer/model.pt', does not resume status!

[2024-03-20 11:52:52,542][root][INFO] - total_num of samplers across ranks: 5281 [2024-03-20 11:52:52,547][root][INFO] - total_num of samplers across ranks: 2042300 No checkpoint found at './exp/offline_8k_paraformer/model.pt', does not resume status! [2024-03-20 11:52:52,577][root][INFO] - total_num of samplers across ranks: 5281 No checkpoint found at './exp/offline_8k_paraformer/model.pt', does not resume status! [2024-03-20 11:52:52,592][root][INFO] - total_num of samplers across ranks: 2042300 [2024-03-20 11:52:52,622][root][INFO] - total_num of samplers across ranks: 5281 No checkpoint found at './exp/offline_8k_paraformer/model.pt', does not resume status! [2024-03-20 11:52:52,675][root][INFO] - total_num of samplers across ranks: 2042300 [2024-03-20 11:52:52,705][root][INFO] - total_num of samplers across ranks: 2042300 [2024-03-20 11:52:52,706][root][INFO] - total_num of samplers across ranks: 5281 No checkpoint found at './exp/offline_8k_paraformer/model.pt', does not resume status! [2024-03-20 11:52:52,735][root][INFO] - total_num of samplers across ranks: 5281 No checkpoint found at './exp/offline_8k_paraformer/model.pt', does not resume status! [2024-03-20 11:52:52,769][root][INFO] - total_num of samplers across ranks: 2042300 [2024-03-20 11:52:52,800][root][INFO] - total_num of samplers across ranks: 5281 No checkpoint found at './exp/offline_8k_paraformer/model.pt', does not resume status! rank: 2, Training Epoch: 1: 0%| | 0/63822 [00:00<?, ?it/s]/usr/local/lib/python3.8/site-packages/torch/autograd/init.py:197: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance. grad.sizes() = [1, 320], strides() = [1, 1] bucket_view.sizes() = [1, 320], strides() = [320, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:325.) Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass /usr/local/lib/python3.8/site-packages/torch/autograd/init.py:197: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance. grad.sizes() = [1, 320], strides() = [1, 1] bucket_view.sizes() = [1, 320], strides() = [320, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:325.) Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass /usr/local/lib/python3.8/site-packages/torch/autograd/init.py:197: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance. grad.sizes() = [1, 320], strides() = [1, 1] bucket_view.sizes() = [1, 320], strides() = [320, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:325.) Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass /usr/local/lib/python3.8/site-packages/torch/autograd/init.py:197: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance. grad.sizes() = [1, 320], strides() = [1, 1] bucket_view.sizes() = [1, 320], strides() = [320, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:325.) Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass /usr/local/lib/python3.8/site-packages/torch/autograd/init.py:197: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance. grad.sizes() = [1, 320], strides() = [1, 1] bucket_view.sizes() = [1, 320], strides() = [320, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:325.) Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass /usr/local/lib/python3.8/site-packages/torch/autograd/init.py:197: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance. grad.sizes() = [1, 320], strides() = [1, 1] bucket_view.sizes() = [1, 320], strides() = [320, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:325.) Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass /usr/local/lib/python3.8/site-packages/torch/autograd/init.py:197: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance. grad.sizes() = [1, 320], strides() = [1, 1] bucket_view.sizes() = [1, 320], strides() = [320, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:325.) Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass /usr/local/lib/python3.8/site-packages/torch/autograd/init.py:197: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance. grad.sizes() = [1, 320], strides() = [1, 1] bucket_view.sizes() = [1, 320], strides() = [320, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:325.) Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [2024-03-20 11:52:56,647][torch.nn.parallel.distributed][INFO] - Reducer buckets have been rebuilt in this iteration. [2024-03-20 11:52:56,649][torch.nn.parallel.distributed][INFO] - Reducer buckets have been rebuilt in this iteration. [2024-03-20 11:52:56,649][torch.nn.parallel.distributed][INFO] - Reducer buckets have been rebuilt in this iteration. [2024-03-20 11:52:56,650][torch.nn.parallel.distributed][INFO] - Reducer buckets have been rebuilt in this iteration. [2024-03-20 11:52:56,650][torch.nn.parallel.distributed][INFO] - Reducer buckets have been rebuilt in this iteration. [2024-03-20 11:52:56,651][torch.nn.parallel.distributed][INFO] - Reducer buckets have been rebuilt in this iteration. [2024-03-20 11:52:56,651][torch.nn.parallel.distributed][INFO] - Reducer buckets have been rebuilt in this iteration. [2024-03-20 11:52:56,651][torch.nn.parallel.distributed][INFO] - Reducer buckets have been rebuilt in this iteration. 2024-03-20 11:53:32, rank: 6, epoch: 0/20, step: 50/63822, total step: 50, (loss: 0.763), (lr: 3.400e-07), [('loss_att', 0.696), ('acc', 0.615), ('loss_pre', 0.067), ('loss', 0.763)], {'data_load': '0.000', 'forward_time': '0.627', 'backward_time': '0.202', 'optim_time': '0.125', 'total_time': '0.955'}, GPU, memory: 1.186 GB, 7.515 GB, 4.203 GB, 8.023 GB: 0%| | 50/2024-03-20 11:53:32, rank: 7, epoch: 0/20, step: 50/63822, total step: 50, (loss: 0.763), (lr: 3.400e-07), [('loss_att', 0.696), ('acc', 0.615), ('loss_pre', 0.067), ('loss', 0.763)], {'data_load': '0.000', 'forward_time': '0.700', 'backward_time': '0.147', 'optim_time': '0.107', 'total_time': '0.955'}, GPU, memory: 1.187 GB, 7.515 GB, 4.205 GB, 8.021 GB: 0%| | 50/2024-03-20 11:53:32, rank: 4, epoch: 0/20, step: 50/63822, total step: 50, (loss: 0.763), (lr: 3.400e-07), [('loss_att', 0.696), ('acc', 0.615), ('loss_pre', 0.067), ('loss', 0.763)], {'data_load': '0.000', 'forward_time': '0.676', 'backward_time': '0.154', 'optim_time': '0.125', 'total_time': '0.955'}, GPU, memory: 1.185 GB, 7.515 GB, 4.205 GB, 8.021 GB: 0%| | 50/2024-03-20 11:53:32, rank: 0, epoch: 0/20, step: 50/63822, total step: 50, (loss: 0.763), (lr: 3.400e-07), [('loss_att', 0.696), ('acc', 0.615), ('loss_pre', 0.067), ('loss', 0.763)], {'data_load': '0.000', 'forward_time': '0.690', 'backward_time': '0.158', 'optim_time': '0.107', 'total_time': '0.955'}, GPU, memory: 1.183 GB, 7.514 GB, 4.227 GB, 7.953 GB: 0%| | 50/2024-03-20 11:53:32, rank: 1, epoch: 0/20, step: 50/63822, total step: 50, (loss: 0.763), (lr: 3.400e-07), [('loss_att', 0.696), ('acc', 0.615), ('loss_pre', 0.067), ('loss', 0.763)], {'data_load': '0.000', 'forward_time': '0.691', 'backward_time': '0.157', 'optim_time': '0.107', 'total_time': '0.955'}, GPU, memory: 1.187 GB, 7.514 GB, 4.203 GB, 8.002 GB: 0%| | 50/2024-03-20 11:53:32, rank: 2, epoch: 0/20, step: 50/63822, total step: 50, (loss: 0.763), (lr: 3.400e-07), [('loss_att', 0.696), ('acc', 0.615), ('loss_pre', 0.067), ('loss', 0.763)], {'data_load': '0.000', 'forward_time': '0.691', 'backward_time': '0.159', 'optim_time': '0.105', 'total_time': '0.956'}, GPU, memory: 1.186 GB, 7.515 GB, 4.209 GB, 8.004 GB: 0%| | 50/

❓ Help

Code########################

exp output dir

output_dir="./exp/offline_8k_paraformer" log_file="${output_dir}/log.txt"

mkdir -p ${output_dir} echo "log_file: ${log_file}"

torchrun
--nnodes 1
--nproc_per_node ${gpu_num}
../../../funasr/bin/train.py
++model="exp/speech_paraformer_asr_nat-zh-cn-8k-common-vocab8358-tensorflow1"
++model_revision="v2.0.4"
++train_data_set_list="${train_data}"
++valid_data_set_list="${val_data}"
++dataset_conf.batch_size=32
++dataset_conf.batch_type="example"
++dataset_conf.num_workers=4
++train_conf.max_epoch=20
++optim_conf.lr=0.0002
++output_dir="${output_dir}" &> ${log_file}

What's your environment?

OS (e.g., Linux)：Debian GNU/Linux 10
FunASR Version (e.g., 1.0.0):1.0.15
ModelScope Version (e.g., 1.11.0):1.110
PyTorch Version (e.g., 2.0.0):1.13.0+cu117
How you installed funasr (pip, source):source
Python version:3.8
GPU (e.g., V100M32)：V100M32

Mar 20 '24 12:03 dospeech

We have update the training with a bugfix of OOM. Please update funasr and try it again.

Mar 21 '24 06:03 LauraGPT

We have update the training with a bugfix of OOM. Please update funasr and try it again. Thanks for the updata,the paraformer_streaming is already for normal training; For this problem,when i updata the code,it still reports the same warning(So I decided to train for a while and then test whether there are obvious problems.): [2024-03-21 11:15:13,728][root][INFO] - Train epoch: 0, rank: 1 [2024-03-21 11:15:13,729][root][INFO] - Train epoch: 0, rank: 3 [2024-03-21 11:15:13,730][root][INFO] - Train epoch: 0, rank: 7 [2024-03-21 11:15:13,730][root][INFO] - Train epoch: 0, rank: 2 [2024-03-21 11:15:13,732][root][INFO] - Train epoch: 0, rank: 0 /usr/local/lib/python3.8/site-packages/torch/autograd/init.py:197: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance. grad.sizes() = [1, 320], strides() = [1, 1] bucket_view.sizes() = [1, 320], strides() = [320, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:325.) Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass /usr/local/lib/python3.8/site-packages/torch/autograd/init.py:197: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance. grad.sizes() = [1, 320], strides() = [1, 1] bucket_view.sizes() = [1, 320], strides() = [320, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:325.) Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass

Mar 21 '24 11:03 dospeech

Please update torch

Mar 21 '24 12:03 LauraGPT

Please update torch

Thanks again!

Mar 21 '24 12:03 dospeech

FunASR
FunASR copied to clipboard

Some warnings for training

❓ Help

Code########################

exp output dir

What's your environment?

FunASR FunASR copied to clipboard

Some warnings for training

❓ Help

Code########################

exp output dir

What's your environment?

FunASR
FunASR copied to clipboard