FunASR icon indicating copy to clipboard operation
FunASR copied to clipboard

Some warnings for training

Open dospeech opened this issue 1 year ago • 4 comments

用finetune.sh在自有数据集上微调下载的speech_paraformer_asr_nat-zh-cn-8k-common-vocab8358-tensorflow1 模型时,有一些warning,如下: grad.sizes() = [1, 320], strides() = [1, 1] bucket_view.sizes() = [1, 320], strides() = [320, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:325.) Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass /usr/local/lib/python3.8/site-packages/torch/autograd/init.py:197: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance.

问题:不知道上述问题严重不,会不会真的会影响训练性能,或者应该怎么修正?

**详细的打印log如下: [2024-03-20 11:52:34,002][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:1 to store for rank: 3 [2024-03-20 11:52:34,060][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:1 to store for rank: 1 [2024-03-20 11:52:34,064][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:1 to store for rank: 0 [2024-03-20 11:52:34,082][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:1 to store for rank: 2 [2024-03-20 11:52:34,136][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:1 to store for rank: 6 [2024-03-20 11:52:34,141][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:1 to store for rank: 5 [2024-03-20 11:52:34,145][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:1 to store for rank: 7 [2024-03-20 11:52:34,148][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:1 to store for rank: 4 [2024-03-20 11:52:34,148][torch.distributed.distributed_c10d][INFO] - Rank 4: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes. [2024-03-20 11:52:34,151][torch.distributed.distributed_c10d][INFO] - Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes. [2024-03-20 11:52:34,152][torch.distributed.distributed_c10d][INFO] - Rank 5: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes. [2024-03-20 11:52:34,153][torch.distributed.distributed_c10d][INFO] - Rank 2: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes. [2024-03-20 11:52:34,155][torch.distributed.distributed_c10d][INFO] - Rank 3: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes. [2024-03-20 11:52:34,155][torch.distributed.distributed_c10d][INFO] - Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes. [2024-03-20 11:52:34,156][torch.distributed.distributed_c10d][INFO] - Rank 7: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes. [2024-03-20 11:52:34,156][torch.distributed.distributed_c10d][INFO] - Rank 6: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes. [2024-03-20 11:52:34,511][root][INFO] - config.yaml is saved to: ./exp/offline_8k_paraformer/config.yaml [2024-03-20 11:52:35,126][root][INFO] - init_param is not None: ('exp/speech_paraformer_asr_nat-zh-cn-8k-common-vocab8358-tensorflow1/model.pt',) [2024-03-20 11:52:35,127][root][INFO] - Loading pretrained params from exp/speech_paraformer_asr_nat-zh-cn-8k-common-vocab8358-tensorflow1/model.pt ckpt: exp/speech_paraformer_asr_nat-zh-cn-8k-common-vocab8358-tensorflow1/model.pt [2024-03-20 11:52:35,139][root][INFO] - init_param is not None: ('exp/speech_paraformer_asr_nat-zh-cn-8k-common-vocab8358-tensorflow1/model.pt',) [2024-03-20 11:52:35,139][root][INFO] - Loading pretrained params from exp/speech_paraformer_asr_nat-zh-cn-8k-common-vocab8358-tensorflow1/model.pt ckpt: exp/speech_paraformer_asr_nat-zh-cn-8k-common-vocab8358-tensorflow1/model.pt [2024-03-20 11:52:35,155][root][INFO] - init_param is not None: ('exp/speech_paraformer_asr_nat-zh-cn-8k-common-vocab8358-tensorflow1/model.pt',) [2024-03-20 11:52:35,155][root][INFO] - Loading pretrained params from exp/speech_paraformer_asr_nat-zh-cn-8k-common-vocab8358-tensorflow1/model.pt [2024-03-20 11:52:35,156][root][INFO] - init_param is not None: ('exp/speech_paraformer_asr_nat-zh-cn-8k-common-vocab8358-tensorflow1/model.pt',) [2024-03-20 11:52:35,156][root][INFO] - Loading pretrained params from exp/speech_paraformer_asr_nat-zh-cn-8k-common-vocab8358-tensorflow1/model.pt [2024-03-20 11:52:35,156][root][INFO] - init_param is not None: ('exp/speech_paraformer_asr_nat-zh-cn-8k-common-vocab8358-tensorflow1/model.pt',) [2024-03-20 11:52:35,156][root][INFO] - Loading pretrained params from exp/speech_paraformer_asr_nat-zh-cn-8k-common-vocab8358-tensorflow1/model.pt ckpt: exp/speech_paraformer_asr_nat-zh-cn-8k-common-vocab8358-tensorflow1/model.pt ckpt: exp/speech_paraformer_asr_nat-zh-cn-8k-common-vocab8358-tensorflow1/model.pt ckpt: exp/speech_paraformer_asr_nat-zh-cn-8k-common-vocab8358-tensorflow1/model.pt [2024-03-20 11:52:35,176][root][INFO] - init_param is not None: ('exp/speech_paraformer_asr_nat-zh-cn-8k-common-vocab8358-tensorflow1/model.pt',) [2024-03-20 11:52:35,177][root][INFO] - Loading pretrained params from exp/speech_paraformer_asr_nat-zh-cn-8k-common-vocab8358-tensorflow1/model.pt ckpt: exp/speech_paraformer_asr_nat-zh-cn-8k-common-vocab8358-tensorflow1/model.pt [2024-03-20 11:52:35,438][root][INFO] - init_param is not None: ('exp/speech_paraformer_asr_nat-zh-cn-8k-common-vocab8358-tensorflow1/model.pt',) [2024-03-20 11:52:35,439][root][INFO] - Loading pretrained params from exp/speech_paraformer_asr_nat-zh-cn-8k-common-vocab8358-tensorflow1/model.pt ckpt: exp/speech_paraformer_asr_nat-zh-cn-8k-common-vocab8358-tensorflow1/model.pt [2024-03-20 11:52:35,465][root][INFO] - init_param is not None: ('exp/speech_paraformer_asr_nat-zh-cn-8k-common-vocab8358-tensorflow1/model.pt',) [2024-03-20 11:52:35,468][root][INFO] - Loading pretrained params from exp/speech_paraformer_asr_nat-zh-cn-8k-common-vocab8358-tensorflow1/model.pt ckpt: exp/speech_paraformer_asr_nat-zh-cn-8k-common-vocab8358-tensorflow1/model.pt [2024-03-20 11:52:52,484][root][INFO] - total_num of samplers across ranks: 2042300 [2024-03-20 11:52:52,491][root][INFO] - total_num of samplers across ranks: 2042300 [2024-03-20 11:52:52,512][root][INFO] - total_num of samplers across ranks: 2042300 [2024-03-20 11:52:52,516][root][INFO] - total_num of samplers across ranks: 5281 [2024-03-20 11:52:52,520][root][INFO] - total_num of samplers across ranks: 5281 No checkpoint found at './exp/offline_8k_paraformer/model.pt', does not resume status!No checkpoint found at './exp/offline_8k_paraformer/model.pt', does not resume status!

[2024-03-20 11:52:52,542][root][INFO] - total_num of samplers across ranks: 5281 [2024-03-20 11:52:52,547][root][INFO] - total_num of samplers across ranks: 2042300 No checkpoint found at './exp/offline_8k_paraformer/model.pt', does not resume status! [2024-03-20 11:52:52,577][root][INFO] - total_num of samplers across ranks: 5281 No checkpoint found at './exp/offline_8k_paraformer/model.pt', does not resume status! [2024-03-20 11:52:52,592][root][INFO] - total_num of samplers across ranks: 2042300 [2024-03-20 11:52:52,622][root][INFO] - total_num of samplers across ranks: 5281 No checkpoint found at './exp/offline_8k_paraformer/model.pt', does not resume status! [2024-03-20 11:52:52,675][root][INFO] - total_num of samplers across ranks: 2042300 [2024-03-20 11:52:52,705][root][INFO] - total_num of samplers across ranks: 2042300 [2024-03-20 11:52:52,706][root][INFO] - total_num of samplers across ranks: 5281 No checkpoint found at './exp/offline_8k_paraformer/model.pt', does not resume status! [2024-03-20 11:52:52,735][root][INFO] - total_num of samplers across ranks: 5281 No checkpoint found at './exp/offline_8k_paraformer/model.pt', does not resume status! [2024-03-20 11:52:52,769][root][INFO] - total_num of samplers across ranks: 2042300 [2024-03-20 11:52:52,800][root][INFO] - total_num of samplers across ranks: 5281 No checkpoint found at './exp/offline_8k_paraformer/model.pt', does not resume status! rank: 2, Training Epoch: 1: 0%| | 0/63822 [00:00<?, ?it/s]/usr/local/lib/python3.8/site-packages/torch/autograd/init.py:197: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance. grad.sizes() = [1, 320], strides() = [1, 1] bucket_view.sizes() = [1, 320], strides() = [320, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:325.) Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass /usr/local/lib/python3.8/site-packages/torch/autograd/init.py:197: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance. grad.sizes() = [1, 320], strides() = [1, 1] bucket_view.sizes() = [1, 320], strides() = [320, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:325.) Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass /usr/local/lib/python3.8/site-packages/torch/autograd/init.py:197: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance. grad.sizes() = [1, 320], strides() = [1, 1] bucket_view.sizes() = [1, 320], strides() = [320, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:325.) Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass /usr/local/lib/python3.8/site-packages/torch/autograd/init.py:197: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance. grad.sizes() = [1, 320], strides() = [1, 1] bucket_view.sizes() = [1, 320], strides() = [320, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:325.) Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass /usr/local/lib/python3.8/site-packages/torch/autograd/init.py:197: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance. grad.sizes() = [1, 320], strides() = [1, 1] bucket_view.sizes() = [1, 320], strides() = [320, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:325.) Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass /usr/local/lib/python3.8/site-packages/torch/autograd/init.py:197: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance. grad.sizes() = [1, 320], strides() = [1, 1] bucket_view.sizes() = [1, 320], strides() = [320, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:325.) Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass /usr/local/lib/python3.8/site-packages/torch/autograd/init.py:197: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance. grad.sizes() = [1, 320], strides() = [1, 1] bucket_view.sizes() = [1, 320], strides() = [320, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:325.) Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass /usr/local/lib/python3.8/site-packages/torch/autograd/init.py:197: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance. grad.sizes() = [1, 320], strides() = [1, 1] bucket_view.sizes() = [1, 320], strides() = [320, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:325.) Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [2024-03-20 11:52:56,647][torch.nn.parallel.distributed][INFO] - Reducer buckets have been rebuilt in this iteration. [2024-03-20 11:52:56,649][torch.nn.parallel.distributed][INFO] - Reducer buckets have been rebuilt in this iteration. [2024-03-20 11:52:56,649][torch.nn.parallel.distributed][INFO] - Reducer buckets have been rebuilt in this iteration. [2024-03-20 11:52:56,650][torch.nn.parallel.distributed][INFO] - Reducer buckets have been rebuilt in this iteration. [2024-03-20 11:52:56,650][torch.nn.parallel.distributed][INFO] - Reducer buckets have been rebuilt in this iteration. [2024-03-20 11:52:56,651][torch.nn.parallel.distributed][INFO] - Reducer buckets have been rebuilt in this iteration. [2024-03-20 11:52:56,651][torch.nn.parallel.distributed][INFO] - Reducer buckets have been rebuilt in this iteration. [2024-03-20 11:52:56,651][torch.nn.parallel.distributed][INFO] - Reducer buckets have been rebuilt in this iteration. 2024-03-20 11:53:32, rank: 6, epoch: 0/20, step: 50/63822, total step: 50, (loss: 0.763), (lr: 3.400e-07), [('loss_att', 0.696), ('acc', 0.615), ('loss_pre', 0.067), ('loss', 0.763)], {'data_load': '0.000', 'forward_time': '0.627', 'backward_time': '0.202', 'optim_time': '0.125', 'total_time': '0.955'}, GPU, memory: 1.186 GB, 7.515 GB, 4.203 GB, 8.023 GB: 0%| | 50/2024-03-20 11:53:32, rank: 7, epoch: 0/20, step: 50/63822, total step: 50, (loss: 0.763), (lr: 3.400e-07), [('loss_att', 0.696), ('acc', 0.615), ('loss_pre', 0.067), ('loss', 0.763)], {'data_load': '0.000', 'forward_time': '0.700', 'backward_time': '0.147', 'optim_time': '0.107', 'total_time': '0.955'}, GPU, memory: 1.187 GB, 7.515 GB, 4.205 GB, 8.021 GB: 0%| | 50/2024-03-20 11:53:32, rank: 4, epoch: 0/20, step: 50/63822, total step: 50, (loss: 0.763), (lr: 3.400e-07), [('loss_att', 0.696), ('acc', 0.615), ('loss_pre', 0.067), ('loss', 0.763)], {'data_load': '0.000', 'forward_time': '0.676', 'backward_time': '0.154', 'optim_time': '0.125', 'total_time': '0.955'}, GPU, memory: 1.185 GB, 7.515 GB, 4.205 GB, 8.021 GB: 0%| | 50/2024-03-20 11:53:32, rank: 0, epoch: 0/20, step: 50/63822, total step: 50, (loss: 0.763), (lr: 3.400e-07), [('loss_att', 0.696), ('acc', 0.615), ('loss_pre', 0.067), ('loss', 0.763)], {'data_load': '0.000', 'forward_time': '0.690', 'backward_time': '0.158', 'optim_time': '0.107', 'total_time': '0.955'}, GPU, memory: 1.183 GB, 7.514 GB, 4.227 GB, 7.953 GB: 0%| | 50/2024-03-20 11:53:32, rank: 1, epoch: 0/20, step: 50/63822, total step: 50, (loss: 0.763), (lr: 3.400e-07), [('loss_att', 0.696), ('acc', 0.615), ('loss_pre', 0.067), ('loss', 0.763)], {'data_load': '0.000', 'forward_time': '0.691', 'backward_time': '0.157', 'optim_time': '0.107', 'total_time': '0.955'}, GPU, memory: 1.187 GB, 7.514 GB, 4.203 GB, 8.002 GB: 0%| | 50/2024-03-20 11:53:32, rank: 2, epoch: 0/20, step: 50/63822, total step: 50, (loss: 0.763), (lr: 3.400e-07), [('loss_att', 0.696), ('acc', 0.615), ('loss_pre', 0.067), ('loss', 0.763)], {'data_load': '0.000', 'forward_time': '0.691', 'backward_time': '0.159', 'optim_time': '0.105', 'total_time': '0.956'}, GPU, memory: 1.186 GB, 7.515 GB, 4.209 GB, 8.004 GB: 0%| | 50/

❓ Help

Code########################

exp output dir

output_dir="./exp/offline_8k_paraformer" log_file="${output_dir}/log.txt"

mkdir -p ${output_dir} echo "log_file: ${log_file}"

torchrun
--nnodes 1
--nproc_per_node ${gpu_num}
../../../funasr/bin/train.py
++model="exp/speech_paraformer_asr_nat-zh-cn-8k-common-vocab8358-tensorflow1"
++model_revision="v2.0.4"
++train_data_set_list="${train_data}"
++valid_data_set_list="${val_data}"
++dataset_conf.batch_size=32
++dataset_conf.batch_type="example"
++dataset_conf.num_workers=4
++train_conf.max_epoch=20
++optim_conf.lr=0.0002
++output_dir="${output_dir}" &> ${log_file}

What's your environment?

  • OS (e.g., Linux):Debian GNU/Linux 10
  • FunASR Version (e.g., 1.0.0):1.0.15
  • ModelScope Version (e.g., 1.11.0):1.110
  • PyTorch Version (e.g., 2.0.0):1.13.0+cu117
  • How you installed funasr (pip, source):source
  • Python version:3.8
  • GPU (e.g., V100M32):V100M32

dospeech avatar Mar 20 '24 12:03 dospeech

We have update the training with a bugfix of OOM. Please update funasr and try it again.

LauraGPT avatar Mar 21 '24 06:03 LauraGPT

We have update the training with a bugfix of OOM. Please update funasr and try it again. Thanks for the updata,the paraformer_streaming is already for normal training; For this problem,when i updata the code,it still reports the same warning(So I decided to train for a while and then test whether there are obvious problems.): [2024-03-21 11:15:13,728][root][INFO] - Train epoch: 0, rank: 1 [2024-03-21 11:15:13,729][root][INFO] - Train epoch: 0, rank: 3 [2024-03-21 11:15:13,730][root][INFO] - Train epoch: 0, rank: 7 [2024-03-21 11:15:13,730][root][INFO] - Train epoch: 0, rank: 2 [2024-03-21 11:15:13,732][root][INFO] - Train epoch: 0, rank: 0 /usr/local/lib/python3.8/site-packages/torch/autograd/init.py:197: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance. grad.sizes() = [1, 320], strides() = [1, 1] bucket_view.sizes() = [1, 320], strides() = [320, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:325.) Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass /usr/local/lib/python3.8/site-packages/torch/autograd/init.py:197: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance. grad.sizes() = [1, 320], strides() = [1, 1] bucket_view.sizes() = [1, 320], strides() = [320, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:325.) Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass

dospeech avatar Mar 21 '24 11:03 dospeech

Please update torch

LauraGPT avatar Mar 21 '24 12:03 LauraGPT

Please update torch

Thanks again!

dospeech avatar Mar 21 '24 12:03 dospeech