FunASR icon indicating copy to clipboard operation
FunASR copied to clipboard

微调会自动删除ep文件, 导致微调结束后找不到需要ep文件

Open bird-9 opened this issue 3 months ago • 3 comments

🐛 Bug

微调会自动删除ep文件, 导致微调结束后找不到需要ep文件

Code sample

训练参数

torchrun \
--nnodes 1 \
--node_rank 0 \
--nproc_per_node ${gpu_num} \
../../../funasr/bin/train.py \
++model="${model_name_or_model_dir}" \
++train_data_set_list="${train_data}" \
++valid_data_set_list="${val_data}" \
++dataset_conf.batch_size=40000 \
++dataset_conf.batch_type="token" \
++dataset_conf.num_workers=8 \
++train_conf.max_epoch=100 \
++train_conf.log_interval=1 \
++train_conf.resume=false \
++train_conf.validate_interval=2000 \
++train_conf.save_checkpoint_interval=2000 \
++train_conf.keep_nbest_models=20 \
++train_conf.avg_nbest_model=10 \
++optim_conf.lr=0.0002 \
++output_dir="${output_dir}" &> ${log_file}

报错日志:

查看outputs最后只保留了20个ep文件,导致Checkpoint file not found

[2024-04-26 07:54:21,218][root][INFO] - Update best acc: 0.1071, /diskb/code/dev/FunASR/examples/industrial_data_pretraining/paraformer/outputs/model.pt.best
[2024-04-26 07:54:21,220][root][INFO] - Delete: /diskb/code/dev/FunASR/examples/industrial_data_pretraining/paraformer/outputs/model.pt.ep80  训练的时候他会删除一些ep文件
[2024-04-26 07:54:21,367][root][INFO] - rank: 0, time_escaped_epoch: 0.014 hours, estimated to finish 100 epoch: 0.000 hours

average_checkpoints: ['/diskb/code/dev/FunASR/examples/industrial_data_pretraining/paraformer/outputs/model.pt.ep0', '/diskb/code/dev/FunASR/examples/industrial_data_pretraining/paraformer/outputs/model.pt.ep1', '/diskb/code/dev/FunASR/examples/industrial_data_pretraining/paraformer/outputs/model.pt.ep2', '/diskb/code/dev/FunASR/examples/industrial_data_pretraining/paraformer/outputs/model.pt.ep3', '/diskb/code/dev/FunASR/examples/industrial_data_pretraining/paraformer/outputs/model.pt.ep4', '/diskb/code/dev/FunASR/examples/industrial_data_pretraining/paraformer/outputs/model.pt.ep5', '/diskb/code/dev/FunASR/examples/industrial_data_pretraining/paraformer/outputs/model.pt.ep6', '/diskb/code/dev/FunASR/examples/industrial_data_pretraining/paraformer/outputs/model.pt.ep7', '/diskb/code/dev/FunASR/examples/industrial_data_pretraining/paraformer/outputs/model.pt.ep8', '/diskb/code/dev/FunASR/examples/industrial_data_pretraining/paraformer/outputs/model.pt.ep9']
Checkpoint file /diskb/code/dev/FunASR/examples/industrial_data_pretraining/paraformer/outputs/model.pt.ep0 not found.
Checkpoint file /diskb/code/dev/FunASR/examples/industrial_data_pretraining/paraformer/outputs/model.pt.ep1 not found.
Checkpoint file /diskb/code/dev/FunASR/examples/industrial_data_pretraining/paraformer/outputs/model.pt.ep2 not found.
Checkpoint file /diskb/code/dev/FunASR/examples/industrial_data_pretraining/paraformer/outputs/model.pt.ep3 not found.
Checkpoint file /diskb/code/dev/FunASR/examples/industrial_data_pretraining/paraformer/outputs/model.pt.ep4 not found.
Checkpoint file /diskb/code/dev/FunASR/examples/industrial_data_pretraining/paraformer/outputs/model.pt.ep5 not found.
Checkpoint file /diskb/code/dev/FunASR/examples/industrial_data_pretraining/paraformer/outputs/model.pt.ep6 not found.
Checkpoint file /diskb/code/dev/FunASR/examples/industrial_data_pretraining/paraformer/outputs/model.pt.ep7 not found.
Checkpoint file /diskb/code/dev/FunASR/examples/industrial_data_pretraining/paraformer/outputs/model.pt.ep8 not found.
Checkpoint file /diskb/code/dev/FunASR/examples/industrial_data_pretraining/paraformer/outputs/model.pt.ep9 not found.

Expected behavior

Environment

  • OS (e.g., Linux): ubuntu
  • FunASR Version (e.g., 1.0.0): 1.0.25
  • ModelScope Version (e.g., 1.11.0):
  • PyTorch Version (e.g., 2.0.0): 2.3.0
  • How you installed funasr (pip, source): pip
  • Python version: 3.10.14
  • GPU (e.g., V100M32) 3096
  • CUDA/cuDNN version (e.g., cuda11.7):
  • Docker version (e.g., funasr-runtime-sdk-cpu-0.4.1)
  • Any other relevant information:

image

bird-9 avatar Apr 26 '24 08:04 bird-9

我也出现了,本来是有的,被删除了,你那解决了没 Checkpoint file ./outputs/model.pt.ep1 not found. Checkpoint file ./outputs/model.pt.ep2 not found. Checkpoint file ./outputs/model.pt.ep3 not found. Checkpoint file ./outputs/model.pt.ep4 not found. Checkpoint file ./outputs/model.pt.ep5 not found. Checkpoint file ./outputs/model.pt.ep6 not found. Checkpoint file ./outputs/model.pt.ep7 not found. Checkpoint file ./outputs/model.pt.ep8 not found. Checkpoint file ./outputs/model.pt.ep9 not found. Checkpoint file ./outputs/model.pt.ep10 not found. Error executing job with overrides: ['++model=iic/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch', '++train_data_set_list=data/train.jsonl', '++valid_data_set_list=data/val.jsonl', '++dataset_conf.batch_size=20000', '++dataset_conf.batch_type=token', '++dataset_conf.num_workers=4', '++train_conf.max_epoch=50', '++train_conf.log_interval=1', '++train_conf.resume=false', '++train_conf.validate_interval=2000', '++train_conf.save_checkpoint_interval=2000', '++train_conf.keep_nbest_models=20', '++train_conf.avg_nbest_model=10', '++optim_conf.lr=0.0002', '++output_dir=./outputs'] Traceback (most recent call last): File "/mnt/workspace/FunASR/funasr/bin/train.py", line 250, in main_hydra() File "/opt/conda/lib/python3.10/site-packages/hydra/main.py", line 94, in decorated_main _run_hydra( File "/opt/conda/lib/python3.10/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra _run_app( File "/opt/conda/lib/python3.10/site-packages/hydra/_internal/utils.py", line 457, in _run_app run_and_report( File "/opt/conda/lib/python3.10/site-packages/hydra/_internal/utils.py", line 223, in run_and_report raise ex File "/opt/conda/lib/python3.10/site-packages/hydra/_internal/utils.py", line 220, in run_and_report return func() File "/opt/conda/lib/python3.10/site-packages/hydra/_internal/utils.py", line 458, in lambda: hydra.run( File "/opt/conda/lib/python3.10/site-packages/hydra/_internal/hydra.py", line 132, in run _ = ret.return_value File "/opt/conda/lib/python3.10/site-packages/hydra/core/utils.py", line 260, in return_value raise self._return_value File "/opt/conda/lib/python3.10/site-packages/hydra/core/utils.py", line 186, in run_job ret.return_value = task_function(task_cfg) File "/mnt/workspace/FunASR/funasr/bin/train.py", line 51, in main_hydra main(**kwargs) File "/mnt/workspace/FunASR/funasr/bin/train.py", line 244, in main average_checkpoints(trainer.output_dir, trainer.avg_nbest_model) File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/mnt/workspace/FunASR/funasr/train_utils/average_nbest_models.py", line 65, in average_checkpoints raise RuntimeError("No checkpoints found for averaging.") RuntimeError: No checkpoints found for averaging.

chenmiaotian avatar Apr 29 '24 05:04 chenmiaotian