FunASR icon indicating copy to clipboard operation
FunASR copied to clipboard

多级多卡训练paraformer模型报错

Open chenpaopao opened this issue 1 year ago • 3 comments

Notice: In order to resolve issues more efficiently, please raise issue following the template. (注意:为了更加高效率解决您遇到的问题,请按照模板提问,补充细节)

🐛 Bug

执行: torchrun --nnodes 2 --node_rank 0 --nproc_per_node ${gpu_num} --master_addr ******* --master_port 1234
../../../funasr/bin/train_ds.py
++model="${model_name_or_model_dir}"
++train_data_set_list="${train_data}"
++valid_data_set_list="${val_data}"
++dataset="AudioDataset"
++dataset_conf.index_ds="IndexDSJsonl"
++dataset_conf.data_split_num=1
++dataset_conf.batch_sampler="BatchSampler"
++dataset_conf.batch_size=6000
++dataset_conf.sort_size=1024
++dataset_conf.batch_type="token"
++dataset_conf.num_workers=12
++train_conf.max_epoch=200
++train_conf.log_interval=100
++train_conf.resume=true
++train_conf.validate_interval=5000
++train_conf.save_checkpoint_interval=5000
++train_conf.keep_nbest_models=50
++train_conf.avg_nbest_model=10
++train_conf.use_deepspeed=true
++train_conf.deepspeed_config=${deepspeed_config}
++optim_conf.lr=0.0008
++output_dir="${output_dir}" &> ${log_file}

模型在训练到几百step的时候报下面的错误:

liangxianchen-asr-2wh-pretrain1-m-0:174396:175238 [0] misc/socket.cc:538 NCCL WARN Net : Connection closed by remote peer liangxianchen-asr-2wh-pretrain1-w-0.liangxianchen-asr-2wh-pretrain1.prdsafe.svc.hbox2-zzzc2-prd.local<48836> liangxianchen-asr-2wh-pretrain1-m-0:174396:175238 [0] NCCL INFO transport/net_socket.cc:493 -> 6 liangxianchen-asr-2wh-pretrain1-m-0:174396:175238 [0] NCCL INFO include/net.h:35 -> 6 liangxianchen-asr-2wh-pretrain1-m-0:174396:175238 [0] NCCL INFO transport/net.cc:1034 -> 6 liangxianchen-asr-2wh-pretrain1-m-0:174396:175238 [0] NCCL INFO proxy.cc:520 -> 6 liangxianchen-asr-2wh-pretrain1-m-0:174396:175238 [0] NCCL INFO proxy.cc:684 -> 6 [Proxy Thread] [E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down. terminate called after throwing an instance of 'std::runtime_error' what(): NCCL error: remote process exited or there was a network error, NCCL version 2.14.3 ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely. Last error: Net : Connection closed by remote peer liangxianchen-asr-2wh-pretrain1-w-0.liangxianchen-asr-2wh-pretrain1.prdsafe.svc.hbox2-zzzc2-prd.local<48836> WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 174397 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 174398 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 174399 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 174400 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 174401 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 174402 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 174403 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 174396) of binary: /mnt/liangxianchen/anaconda3/envs/python38/bin/python Traceback (most recent call last): File "/mnt/liangxianchen/anaconda3/envs/python38/bin/torchrun", line 8, in sys.exit(main()) File "/mnt/liangxianchen/anaconda3/envs/python38/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, **kwargs) File "/mnt/liangxianchen/anaconda3/envs/python38/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main run(args) File "/mnt/liangxianchen/anaconda3/envs/python38/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run elastic_launch( File "/mnt/liangxianchen/anaconda3/envs/python38/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/mnt/liangxianchen/anaconda3/envs/python38/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

另外,我最开始设置batchsize=7000,在训练到300步的时候 就会报上面的错误,如果把batchsize设置为6000,在跑到1000步的时候报上面的错误,另外,我在torch.distributed.init_process_group()中增大超时参数,在batchsize设置为6000的时候,step 5000步的时候才报错

chenpaopao avatar Jul 19 '24 01:07 chenpaopao

show me the full logfile.

LauraGPT avatar Jul 19 '24 02:07 LauraGPT

same problem and the full logfile: tail: log.txt: file truncated W1105 15:01:33.051000 140432080086848 torch/distributed/run.py:779] W1105 15:01:33.051000 140432080086848 torch/distributed/run.py:779] ***************************************** W1105 15:01:33.051000 140432080086848 torch/distributed/run.py:779] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W1105 15:01:33.051000 140432080086848 torch/distributed/run.py:779] ***************************************** Key Conformer already exists in model_classes, re-register Key Conformer already exists in model_classes, re-register Key Conformer already exists in model_classes, re-register Key Conformer already exists in model_classes, re-register Key Conformer already exists in model_classes, re-register Key Conformer already exists in model_classes, re-register Key Conformer already exists in model_classes, re-register Key Linear already exists in adaptor_classes, re-register Key Linear already exists in adaptor_classes, re-register Key Linear already exists in adaptor_classes, re-register Key Linear already exists in adaptor_classes, re-register Key Linear already exists in adaptor_classes, re-register Key Linear already exists in adaptor_classes, re-register Key Linear already exists in adaptor_classes, re-register Key TransformerDecoder already exists in decoder_classes, re-register Key LightweightConvolutionTransformerDecoder already exists in decoder_classes, re-register Key TransformerDecoder already exists in decoder_classes, re-register Key LightweightConvolution2DTransformerDecoder already exists in decoder_classes, re-register Key DynamicConvolutionTransformerDecoder already exists in decoder_classes, re-register Key LightweightConvolutionTransformerDecoder already exists in decoder_classes, re-register Key DynamicConvolution2DTransformerDecoder already exists in decoder_classes, re-register Key LightweightConvolution2DTransformerDecoder already exists in decoder_classes, re-register Key TransformerDecoder already exists in decoder_classes, re-register Key DynamicConvolutionTransformerDecoder already exists in decoder_classes, re-register Key DynamicConvolution2DTransformerDecoder already exists in decoder_classes, re-register Key LightweightConvolutionTransformerDecoder already exists in decoder_classes, re-register Key LightweightConvolution2DTransformerDecoder already exists in decoder_classes, re-register Key TransformerDecoder already exists in decoder_classes, re-register Key DynamicConvolutionTransformerDecoder already exists in decoder_classes, re-register Key TransformerDecoder already exists in decoder_classes, re-register Key DynamicConvolution2DTransformerDecoder already exists in decoder_classes, re-register Key LightweightConvolutionTransformerDecoder already exists in decoder_classes, re-register Key LightweightConvolutionTransformerDecoder already exists in decoder_classes, re-register Key LightweightConvolution2DTransformerDecoder already exists in decoder_classes, re-register Key LightweightConvolution2DTransformerDecoder already exists in decoder_classes, re-register Key DynamicConvolutionTransformerDecoder already exists in decoder_classes, re-register Key TransformerDecoder already exists in decoder_classes, re-register Key TransformerDecoder already exists in decoder_classes, re-register Key DynamicConvolutionTransformerDecoder already exists in decoder_classes, re-register Key DynamicConvolution2DTransformerDecoder already exists in decoder_classes, re-register Key DynamicConvolution2DTransformerDecoder already exists in decoder_classes, re-register Key LightweightConvolutionTransformerDecoder already exists in decoder_classes, re-register Key LightweightConvolutionTransformerDecoder already exists in decoder_classes, re-register Key LightweightConvolution2DTransformerDecoder already exists in decoder_classes, re-register Key LightweightConvolution2DTransformerDecoder already exists in decoder_classes, re-register Key DynamicConvolutionTransformerDecoder already exists in decoder_classes, re-register Key DynamicConvolutionTransformerDecoder already exists in decoder_classes, re-register Key DynamicConvolution2DTransformerDecoder already exists in decoder_classes, re-register Key DynamicConvolution2DTransformerDecoder already exists in decoder_classes, re-register [2024-11-05 15:01:38,324][root][INFO] - download models from model hub: ms [2024-11-05 15:01:38,357][root][INFO] - use_ddp: True, use_fsdp: False [2024-11-05 15:01:38,357][root][INFO] - download models from model hub: ms [2024-11-05 15:01:38,372][root][INFO] - download models from model hub: ms [2024-11-05 15:01:38,388][root][INFO] - use_ddp: True, use_fsdp: False [2024-11-05 15:01:38,399][root][INFO] - download models from model hub: ms [2024-11-05 15:01:38,403][root][INFO] - use_ddp: True, use_fsdp: False [2024-11-05 15:01:38,420][root][INFO] - download models from model hub: ms [2024-11-05 15:01:38,427][root][INFO] - download models from model hub: ms [2024-11-05 15:01:38,429][root][INFO] - use_ddp: True, use_fsdp: False [2024-11-05 15:01:38,433][root][INFO] - download models from model hub: ms [2024-11-05 15:01:38,458][root][INFO] - use_ddp: True, use_fsdp: False

tables:

----------- ** dataset_classes ** -------------- | register name | class name | class location | | AudioDataset | AudioDataset | funasr/datasets/audio_datasets/datasets.py:9 | | AudioDatasetHotword | AudioDatasetHotword | funasr/datasets/audio_datasets/datasets.py:121 | | AudioLLMARDataset | AudioLLMARDataset | funasr/datasets/llm_datasets/datasets.py:302 | | AudioLLMDataset | AudioLLMDataset | funasr/datasets/llm_datasets/datasets.py:167 | | AudioLLMNARDataset | AudioLLMNARDataset | funasr/datasets/llm_datasets/datasets.py:8 | | AudioLLMQwenAudioDataset | AudioLLMQwenAudioDataset | funasr/datasets/llm_datasets_qwenaudio/datasets.py:8 | | AudioLLMVicunaDataset | AudioLLMVicunaDataset | funasr/datasets/llm_datasets_vicuna/datasets.py:8 | | KwsMTDataset | KwsMTDataset | funasr/datasets/kws_datasets/datasets.py:9 | | OpenAIDataset | OpenAIDataset | funasr/datasets/openai_datasets/datasets.py:10 | | OpenAIDatasetMultiTurn | OpenAIDatasetMultiTurn | funasr/datasets/openai_datasets/datasets.py:232 | | SenseVoiceCTCDataset | SenseVoiceCTCDataset | funasr/datasets/sense_voice_datasets/datasets.py:234 | | SenseVoiceDataset | SenseVoiceDataset | funasr/datasets/sense_voice_datasets/datasets.py:11 | ----------- ** batch_sampler_classes ** -------------- | register name | class name | class location | | BatchSampler | CustomDistributedBatchSampler_fn | funasr/datasets/audio_datasets/samplers.py:14 | | CustomDistributedBatchSampler | CustomDistributedBatchSampler_fn | funasr/datasets/audio_datasets/samplers.py:14 | | CustomDistributedDynamicBatchSampler | CustomDistributedBatchSampler_fn | funasr/datasets/audio_datasets/samplers.py:14 | | DynamicBatchLocalShuffleSampler | CustomDistributedBatchSampler_fn | funasr/datasets/audio_datasets/samplers.py:14 | | EspnetStyleBatchSampler | EspnetStyleBatchSampler_fn | funasr/datasets/audio_datasets/espnet_samplers.py:13 | | RankFullLocalShuffleBatchSampler | CustomDistributedBatchSampler_fn | funasr/datasets/audio_datasets/samplers.py:14 | | RankFullLocalShuffleDynamicBatchSampler | CustomDistributedBatchSampler_fn | funasr/datasets/audio_datasets/samplers.py:14 | ----------- ** index_ds_classes ** -------------- | register name | class name | class location | | IndexDSJsonl | IndexDSJsonlRankFull | funasr/datasets/audio_datasets/index_ds.py:13 | | IndexDSJsonlRankFull | IndexDSJsonlRankFull | funasr/datasets/audio_datasets/index_ds.py:13 | | IndexDSJsonlRankSplit | IndexDSJsonlRankFull | funasr/datasets/audio_datasets/index_ds.py:13 | | OpenAIIndexDSJsonl | OpenAIIndexDSJsonl | funasr/datasets/openai_datasets/index_ds.py:13 | ----------- ** preprocessor_classes ** -------------- | register name | class name | class location | | SpeechPreprocessSpeedPerturb | SpeechPreprocessSpeedPerturb | funasr/datasets/audio_datasets/preprocessor.py:18 | | TextPreprocessRemovePunctuation | TextPreprocessRemovePunctuation | funasr/datasets/llm_datasets/preprocessor.py:19 | | TextPreprocessSegDict | TextPreprocessSegDict | funasr/datasets/audio_datasets/preprocessor.py:39 | ----------- ** dataloader_classes ** -------------- | register name | class name | class location | | DataloaderIterable | DataloaderIterable | funasr/datasets/dataloader_entry.py:120 | | DataloaderMapStyle | DataloaderMapStyle | funasr/datasets/dataloader_entry.py:47 | ----------- ** frontend_classes ** -------------- | register name | class name | class location | | DefaultFrontend | DefaultFrontend | funasr/frontends/default.py:22 | | EspnetFrontend | DefaultFrontend | funasr/frontends/default.py:22 | | WavFrontend | WavFrontend | funasr/frontends/wav_frontend.py:78 | | WavFrontendOnline | WavFrontendOnline | funasr/frontends/wav_frontend.py:212 | | WhisperFrontend | WhisperFrontend | funasr/frontends/whisper_frontend.py:10 | | wav_frontend | WavFrontend | funasr/frontends/wav_frontend.py:78 | ----------- ** joint_network_classes ** -------------- | register name | class name | class location | | joint_network | JointNetwork | funasr/models/transducer/joint_network.py:12 | ----------- ** model_classes ** -------------- | register name | class name | class location | | BAT | BAT | funasr/models/bat/model.py:35 | | BiCifParaformer | BiCifParaformer | funasr/models/bicif_paraformer/model.py:37 | | Branchformer | Branchformer | funasr/models/branchformer/model.py:7 | | CAMPPlus | CAMPPlus | funasr/models/campplus/model.py:37 | | CTC | Transformer | funasr/models/ctc/model.py:17 | | CTTransformer | CTTransformer | funasr/models/ct_transformer/model.py:34 | | CTTransformerStreaming | CTTransformerStreaming | funasr/models/ct_transformer_streaming/model.py:27 | | Conformer | Conformer | funasr/models/conformer_rwkv/model.py:9 | | ContextualParaformer | ContextualParaformer | funasr/models/contextual_paraformer/model.py:40 | | EBranchformer | EBranchformer | funasr/models/e_branchformer/model.py:7 | | Emotion2vec | Emotion2vec | funasr/models/emotion2vec/model.py:34 | | FsmnKWS | FsmnKWS | funasr/models/fsmn_kws/model.py:26 | | FsmnKWSConvert | FsmnKWSConvert | funasr/models/fsmn_kws/model.py:240 | | FsmnKWSMT | FsmnKWSMT | funasr/models/fsmn_kws_mt/model.py:26 | | FsmnKWSMTConvert | FsmnKWSMTConvert | funasr/models/fsmn_kws_mt/model.py:302 | | FsmnVADStreaming | FsmnVADStreaming | funasr/models/fsmn_vad_streaming/model.py:280 | | LCBNet | LCBNet | funasr/models/lcbnet/model.py:27 | | LLMASR | LLMASR | funasr/models/llm_asr/model.py:27 | | LLMASR2 | LLMASR2 | funasr/models/llm_asr/model.py:348 | | LLMASR3 | LLMASR3 | funasr/models/llm_asr/model.py:829 | | LLMASR4 | LLMASR4 | funasr/models/llm_asr/model.py:847 | | LLMASRNAR | LLMASRNAR | funasr/models/llm_asr_nar/model.py:25 | | LLMASRNARPrompt | LLMASRNARPrompt | funasr/models/llm_asr_nar/model.py:370 | | MonotonicAligner | MonotonicAligner | funasr/models/monotonic_aligner/model.py:24 | | OpenAIWhisperLIDModel | OpenAIWhisperLIDModel | funasr/models/whisper_lid/model.py:457 | | OpenAIWhisperModel | OpenAIWhisperModel | funasr/models/whisper_lid/model.py:21 | | Paraformer | Paraformer | funasr/models/paraformer/model.py:29 | | ParaformerStreaming | ParaformerStreaming | funasr/models/paraformer_streaming/model.py:37 | | Qwen-Audio | QwenAudioWarp | funasr/models/qwen_audio/model.py:17 | | Qwen-Audio-Chat | QwenAudioChatWarp | funasr/models/qwen_audio/model.py:82 | | Qwen/Qwen-Audio | QwenAudioWarp | funasr/models/qwen_audio/model.py:17 | | Qwen/Qwen-Audio-Chat | QwenAudioChatWarp | funasr/models/qwen_audio/model.py:82 | | Qwen/QwenAudio | QwenAudioWarp | funasr/models/qwen_audio/model.py:17 | | Qwen/QwenAudioChat | QwenAudioChatWarp | funasr/models/qwen_audio/model.py:82 | | QwenAudio | QwenAudioWarp | funasr/models/qwen_audio/model.py:17 | | QwenAudioChat | QwenAudioChatWarp | funasr/models/qwen_audio/model.py:82 | | QwenAudioChatWarp | QwenAudioChatWarp | funasr/models/qwen_audio/model.py:82 | | QwenAudioWarp | QwenAudioWarp | funasr/models/qwen_audio/model.py:17 | | SANM | SANM | funasr/models/sanm/model.py:14 | | SCAMA | SCAMA | funasr/models/scama/model.py:39 | | SanmKWS | SanmKWS | funasr/models/sanm_kws/model.py:27 | | SanmKWSStreaming | SanmKWSStreaming | funasr/models/sanm_kws_streaming/model.py:37 | | SeacoParaformer | SeacoParaformer | funasr/models/seaco_paraformer/model.py:43 | | SenseVoiceSmall | SenseVoiceSmall | funasr/models/sense_voice/model.py:587 | | Transducer | Transducer | funasr/models/transducer/model.py:34 | | Transformer | Transformer | funasr/models/transformer/model.py:22 | | UniASR | UniASR | funasr/models/uniasr/model.py:26 | | Whisper-base | WhisperWarp | funasr/models/whisper/model.py:20 | | Whisper-base.en | WhisperWarp | funasr/models/whisper/model.py:20 | | Whisper-large-v1 | WhisperWarp | funasr/models/whisper/model.py:20 | | Whisper-large-v2 | WhisperWarp | funasr/models/whisper/model.py:20 | | Whisper-large-v3 | WhisperWarp | funasr/models/whisper/model.py:20 | | Whisper-large-v3-turbo | WhisperWarp | funasr/models/whisper/model.py:20 | | Whisper-medium | WhisperWarp | funasr/models/whisper/model.py:20 | | Whisper-medium.en | WhisperWarp | funasr/models/whisper/model.py:20 | | Whisper-small | WhisperWarp | funasr/models/whisper/model.py:20 | | Whisper-small.en | WhisperWarp | funasr/models/whisper/model.py:20 | | Whisper-tiny | WhisperWarp | funasr/models/whisper/model.py:20 | | Whisper-tiny.en | WhisperWarp | funasr/models/whisper/model.py:20 | | WhisperWarp | WhisperWarp | funasr/models/whisper/model.py:20 | ----------- ** predictor_classes ** -------------- | register name | class name | class location | | CifPredictor | CifPredictor | funasr/models/paraformer/cif_predictor.py:16 | | CifPredictorV2 | CifPredictorV2 | funasr/models/paraformer/cif_predictor.py:172 | | CifPredictorV2Export | CifPredictorV2Export | funasr/models/paraformer/cif_predictor.py:430 | | CifPredictorV3 | CifPredictorV3 | funasr/models/bicif_paraformer/cif_predictor.py:96 | | CifPredictorV3Export | CifPredictorV3Export | funasr/models/bicif_paraformer/cif_predictor.py:374 | ----------- ** encoder_classes ** -------------- | register name | class name | class location | | BranchformerEncoder | BranchformerEncoder | funasr/models/branchformer/encoder.py:278 | | ChunkConformerEncoder | ConformerChunkEncoder | funasr/models/conformer/encoder.py:884 | | ConformerEncoder | ConformerEncoder | funasr/models/conformer/encoder.py:286 | | ConvBiasPredictor | ConvPredictor | funasr/models/lcbnet/encoder.py:357 | | EBranchformerEncoder | EBranchformerEncoder | funasr/models/e_branchformer/encoder.py:179 | | FSMN | FSMN | funasr/models/fsmn_vad_streaming/encoder.py:199 | | FSMNConvert | FSMNConvert | funasr/models/fsmn_kws/encoder.py:422 | | FSMNExport | FSMNExport | funasr/models/fsmn_vad_streaming/encoder.py:274 | | FSMNMT | FSMNMT | funasr/models/fsmn_kws_mt/encoder.py:27 | | FSMNMTConvert | FSMNMTConvert | funasr/models/fsmn_kws_mt/encoder.py:106 | | FusionSANEncoder | SelfSrcAttention | funasr/models/lcbnet/encoder.py:228 | | OpenAIWhisperEncoderWarp | OpenAIWhisperEncoderWarp | funasr/models/whisper_lid/encoder.py:17 | | QwenAudioEncoder | QwenAudioEncoder | funasr/models/qwen_audio/audio.py:333 | | RWKVEncoder | RWKVEncoder | funasr/models/rwkv_bat/rwkv_encoder.py:16 | | SANMEncoder | SANMEncoder | funasr/models/sanm/encoder.py:187 | | SANMEncoderChunkOpt | SANMEncoderChunkOpt | funasr/models/scama/encoder.py:187 | | SANMEncoderChunkOptExport | SANMEncoderExport | funasr/models/sanm/encoder.py:516 | | SANMEncoderExport | SANMEncoderExport | funasr/models/sanm/encoder.py:516 | | SANMVadEncoder | SANMVadEncoder | funasr/models/ct_transformer_streaming/encoder.py:174 | | SANMVadEncoderExport | SANMVadEncoderExport | funasr/models/ct_transformer_streaming/encoder.py:436 | | SenseVoiceEncoderSmall | SenseVoiceEncoderSmall | funasr/models/sense_voice/model.py:443 | | TransformerEncoder | TransformerEncoder | funasr/models/transformer/encoder.py:139 | | TransformerTextEncoder | TransformerTextEncoder | funasr/models/lcbnet/encoder.py:130 | ----------- ** decoder_classes ** -------------- | register name | class name | class location | | ContextualParaformerDecoder | ContextualParaformerDecoder | funasr/models/contextual_paraformer/decoder.py:114 | | ContextualParaformerDecoderExport | ContextualParaformerDecoderExport | funasr/models/contextual_paraformer/decoder.py:315 | | DynamicConvolution2DTransformerDecoder | DynamicConvolution2DTransformerDecoder | funasr/models/sa_asr/transformer_decoder.py:674 | | DynamicConvolutionTransformerDecoder | DynamicConvolutionTransformerDecoder | funasr/models/sa_asr/transformer_decoder.py:614 | | FsmnDecoder | FsmnDecoder | funasr/models/sanm/decoder.py:203 | | FsmnDecoderSCAMAOpt | FsmnDecoderSCAMAOpt | funasr/models/scama/decoder.py:203 | | LightweightConvolution2DTransformerDecoder | LightweightConvolution2DTransformerDecoder | funasr/models/sa_asr/transformer_decoder.py:554 | | LightweightConvolutionTransformerDecoder | LightweightConvolutionTransformerDecoder | funasr/models/sa_asr/transformer_decoder.py:494 | | OpenAIWhisperDecoderWarp | OpenAIWhisperDecoderWarp | funasr/models/whisper_lid/decoder.py:15 | | ParaformerDecoderSAN | ParaformerDecoderSAN | funasr/models/sa_asr/transformer_decoder.py:388 | | ParaformerDecoderSANExport | ParaformerDecoderSANExport | funasr/models/paraformer/decoder.py:1087 | | ParaformerSANDecoder | ParaformerSANDecoder | funasr/models/paraformer/decoder.py:981 | | ParaformerSANMDecoder | ParaformerSANMDecoder | funasr/models/paraformer/decoder.py:224 | | ParaformerSANMDecoderExport | ParaformerSANMDecoderExport | funasr/models/paraformer/decoder.py:640 | | ParaformerSANMDecoderOnlineExport | ParaformerSANMDecoderOnlineExport | funasr/models/paraformer/decoder.py:829 | | TransformerDecoder | TransformerDecoder | funasr/models/sa_asr/transformer_decoder.py:343 | | TransformerRWKVDecoder | TransformerRWKVDecoder | funasr/models/conformer_rwkv/decoder.py:378 | | rnn_decoder | RNNDecoder | funasr/models/transducer/rnn_decoder.py:85 | | rnnt_decoder | RNNTDecoder | funasr/models/transducer/rnnt_decoder.py:14 | ----------- ** adaptor_classes ** -------------- | register name | class name | class location | | Linear | Linear | funasr/models/llm_asr_nar/adaptor.py:7 | | QFormer | EncoderProjectorQFormer | funasr/models/llm_asr/adaptor.py:35 | | Transformer | Transformer | funasr/models/llm_asr/adaptor.py:92 | ----------- ** normalize_classes ** -------------- | register name | class name | class location | | GlobalMVN | GlobalMVN | funasr/models/normalize/global_mvn.py:12 | | UtteranceMVN | UtteranceMVN | funasr/models/normalize/utterance_mvn.py:9 | ----------- ** specaug_classes ** -------------- | register name | class name | class location | | SpecAug | SpecAug | funasr/models/specaug/specaug.py:16 | | SpecAugLFR | SpecAugLFR | funasr/models/specaug/specaug.py:105 | ----------- ** lid_predictor_classes ** -------------- | register name | class name | class location | | LidPredictor | LidPredictor | funasr/models/whisper_lid/lid_predictor.py:9 | ----------- ** tokenizer_classes ** -------------- | register name | class name | class location | | CharTokenizer | CharTokenizer | funasr/tokenizer/char_tokenizer.py:12 | | HuggingfaceTokenizer | HuggingfaceTokenizer | funasr/tokenizer/hf_tokenizer.py:4 | | SenseVoiceTokenizer | SenseVoiceTokenizer | funasr/tokenizer/whisper_tokenizer.py:25 | | SentencepiecesTokenizer | SentencepiecesTokenizer | funasr/tokenizer/sentencepiece_tokenizer.py:12 | | WhisperTokenizer | WhisperTokenizer | funasr/tokenizer/whisper_tokenizer.py:4 |

[2024-11-05 15:01:38,459][root][INFO] - use_ddp: True, use_fsdp: False [2024-11-05 15:01:38,464][root][INFO] - use_ddp: True, use_fsdp: False [2024-11-05 15:01:38,759][root][INFO] - Build model, frontend, tokenizer funasr version: 1.1.14. Check update of funasr, and it would cost few times. You may disable it by set disable_update=True in AutoModel [2024-11-05 15:01:38,885][root][INFO] - Build model, frontend, tokenizer funasr version: 1.1.14. Check update of funasr, and it would cost few times. You may disable it by set disable_update=True in AutoModel [2024-11-05 15:01:38,958][root][INFO] - Build model, frontend, tokenizer funasr version: 1.1.14. Check update of funasr, and it would cost few times. You may disable it by set disable_update=True in AutoModel [2024-11-05 15:01:39,074][root][INFO] - Build model, frontend, tokenizer funasr version: 1.1.14. Check update of funasr, and it would cost few times. You may disable it by set disable_update=True in AutoModel [2024-11-05 15:01:39,091][root][INFO] - Build model, frontend, tokenizer funasr version: 1.1.14. Check update of funasr, and it would cost few times. You may disable it by set disable_update=True in AutoModel [2024-11-05 15:01:39,123][root][INFO] - Build model, frontend, tokenizer funasr version: 1.1.14. Check update of funasr, and it would cost few times. You may disable it by set disable_update=True in AutoModel [2024-11-05 15:01:39,148][root][INFO] - Build model, frontend, tokenizer funasr version: 1.1.14. Check update of funasr, and it would cost few times. You may disable it by set disable_update=True in AutoModel You are using the latest version of funasr-1.1.14 You are using the latest version of funasr-1.1.14 You are using the latest version of funasr-1.1.14 You are using the latest version of funasr-1.1.14 You are using the latest version of funasr-1.1.14 You are using the latest version of funasr-1.1.14 [2024-11-05 15:01:41,831][root][INFO] - Loading pretrained params from /nvme2/chaoshan/FunASR/examples/industrial_data_pretraining/paraformer/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/model.pt [2024-11-05 15:01:41,837][root][INFO] - ckpt: /nvme2/chaoshan/FunASR/examples/industrial_data_pretraining/paraformer/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/model.pt /nvme2/chaoshan/FunASR/funasr/train_utils/load_pretrained_model.py:39: FutureWarning: You are using torch.load with weights_only=False (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for weights_only will be flipped to True. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via torch.serialization.add_safe_globals. We recommend you start setting weights_only=True for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. ori_state = torch.load(path, map_location=map_location) Error executing job with overrides: ['++model=/nvme2/chaoshan/FunASR/examples/industrial_data_pretraining/paraformer/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch', '++train_data_set_list=/nvme2/chaoshanData/1//train_new.jsonl', '++valid_data_set_list=/nvme2/chaoshanData/1//val.jsonl', '++dataset=AudioDataset', '++dataset_conf.index_ds=IndexDSJsonl', '++dataset_conf.data_split_num=1', '++dataset_conf.batch_sampler=BatchSampler', '++dataset_conf.batch_size=2000', '++dataset_conf.sort_size=1024', '++dataset_conf.batch_type=token', '++dataset_conf.num_workers=4', '++train_conf.max_epoch=50', '++train_conf.log_interval=1', '++train_conf.resume=true', '++train_conf.validate_interval=2000', '++train_conf.save_checkpoint_interval=2000', '++train_conf.keep_nbest_models=20', '++train_conf.avg_nbest_model=10', '++train_conf.use_deepspeed=false', '++train_conf.deepspeed_config=/nvme2/chaoshan/FunASR/examples/industrial_data_pretraining/paraformer/../../ds_stage1.json', '++optim_conf.lr=0.0002', '++output_dir=./output'] [rank0]: Traceback (most recent call last): [rank0]: File "/nvme2/chaoshan/FunASR/funasr/bin/train_ds.py", line 225, in [rank0]: main_hydra() [rank0]: File "/home/tx/anaconda3/envs/mt/lib/python3.10/site-packages/hydra/main.py", line 94, in decorated_main [rank0]: _run_hydra( [rank0]: File "/home/tx/anaconda3/envs/mt/lib/python3.10/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra [rank0]: _run_app( [rank0]: File "/home/tx/anaconda3/envs/mt/lib/python3.10/site-packages/hydra/_internal/utils.py", line 457, in _run_app [rank0]: run_and_report( [rank0]: File "/home/tx/anaconda3/envs/mt/lib/python3.10/site-packages/hydra/_internal/utils.py", line 223, in run_and_report [rank0]: raise ex [rank0]: File "/home/tx/anaconda3/envs/mt/lib/python3.10/site-packages/hydra/_internal/utils.py", line 220, in run_and_report [rank0]: return func() [rank0]: File "/home/tx/anaconda3/envs/mt/lib/python3.10/site-packages/hydra/_internal/utils.py", line 458, in [rank0]: lambda: hydra.run( [rank0]: File "/home/tx/anaconda3/envs/mt/lib/python3.10/site-packages/hydra/_internal/hydra.py", line 132, in run [rank0]: _ = ret.return_value [rank0]: File "/home/tx/anaconda3/envs/mt/lib/python3.10/site-packages/hydra/core/utils.py", line 260, in return_value [rank0]: raise self._return_value [rank0]: File "/home/tx/anaconda3/envs/mt/lib/python3.10/site-packages/hydra/core/utils.py", line 186, in run_job [rank0]: ret.return_value = task_function(task_cfg) [rank0]: File "/nvme2/chaoshan/FunASR/funasr/bin/train_ds.py", line 56, in main_hydra [rank0]: main(**kwargs) [rank0]: File "/nvme2/chaoshan/FunASR/funasr/bin/train_ds.py", line 95, in main [rank0]: model = AutoModel(**kwargs) [rank0]: File "/nvme2/chaoshan/FunASR/funasr/auto/auto_model.py", line 125, in init [rank0]: model, kwargs = self.build_model(**kwargs) [rank0]: File "/nvme2/chaoshan/FunASR/funasr/auto/auto_model.py", line 270, in build_model [rank0]: load_pretrained_model( [rank0]: File "/nvme2/chaoshan/FunASR/funasr/train_utils/load_pretrained_model.py", line 39, in load_pretrained_model [rank0]: ori_state = torch.load(path, map_location=map_location) [rank0]: File "/home/tx/anaconda3/envs/mt/lib/python3.10/site-packages/torch/serialization.py", line 1114, in load [rank0]: return _legacy_load( [rank0]: File "/home/tx/anaconda3/envs/mt/lib/python3.10/site-packages/torch/serialization.py", line 1338, in _legacy_load [rank0]: magic_number = pickle_module.load(f, **pickle_load_args) [rank0]: _pickle.UnpicklingError: invalid load key, 'v'. [rank0]:[W1105 15:01:42.300788901 ProcessGroupNCCL.cpp:1168] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator()) [2024-11-05 15:01:42,408][root][INFO] - Loading pretrained params from /nvme2/chaoshan/FunASR/examples/industrial_data_pretraining/paraformer/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/model.pt [2024-11-05 15:01:42,414][root][INFO] - ckpt: /nvme2/chaoshan/FunASR/examples/industrial_data_pretraining/paraformer/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/model.pt /nvme2/chaoshan/FunASR/funasr/train_utils/load_pretrained_model.py:39: FutureWarning: You are using torch.load with weights_only=False (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for weights_only will be flipped to True. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via torch.serialization.add_safe_globals. We recommend you start setting weights_only=True for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. ori_state = torch.load(path, map_location=map_location) Error executing job with overrides: ['++model=/nvme2/chaoshan/FunASR/examples/industrial_data_pretraining/paraformer/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch', '++train_data_set_list=/nvme2/chaoshanData/1//train_new.jsonl', '++valid_data_set_list=/nvme2/chaoshanData/1//val.jsonl', '++dataset=AudioDataset', '++dataset_conf.index_ds=IndexDSJsonl', '++dataset_conf.data_split_num=1', '++dataset_conf.batch_sampler=BatchSampler', '++dataset_conf.batch_size=2000', '++dataset_conf.sort_size=1024', '++dataset_conf.batch_type=token', '++dataset_conf.num_workers=4', '++train_conf.max_epoch=50', '++train_conf.log_interval=1', '++train_conf.resume=true', '++train_conf.validate_interval=2000', '++train_conf.save_checkpoint_interval=2000', '++train_conf.keep_nbest_models=20', '++train_conf.avg_nbest_model=10', '++train_conf.use_deepspeed=false', '++train_conf.deepspeed_config=/nvme2/chaoshan/FunASR/examples/industrial_data_pretraining/paraformer/../../ds_stage1.json', '++optim_conf.lr=0.0002', '++output_dir=./output'] [2024-11-05 15:01:42,415][root][INFO] - Loading pretrained params from /nvme2/chaoshan/FunASR/examples/industrial_data_pretraining/paraformer/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/model.pt [rank6]: Traceback (most recent call last): [rank6]: File "/nvme2/chaoshan/FunASR/funasr/bin/train_ds.py", line 225, in [rank6]: main_hydra() [rank6]: File "/home/tx/anaconda3/envs/mt/lib/python3.10/site-packages/hydra/main.py", line 94, in decorated_main [rank6]: _run_hydra( [rank6]: File "/home/tx/anaconda3/envs/mt/lib/python3.10/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra [rank6]: _run_app( [rank6]: File "/home/tx/anaconda3/envs/mt/lib/python3.10/site-packages/hydra/_internal/utils.py", line 457, in _run_app [rank6]: run_and_report( [rank6]: File "/home/tx/anaconda3/envs/mt/lib/python3.10/site-packages/hydra/_internal/utils.py", line 223, in run_and_report [rank6]: raise ex [rank6]: File "/home/tx/anaconda3/envs/mt/lib/python3.10/site-packages/hydra/_internal/utils.py", line 220, in run_and_report [rank6]: return func() [rank6]: File "/home/tx/anaconda3/envs/mt/lib/python3.10/site-packages/hydra/_internal/utils.py", line 458, in [rank6]: lambda: hydra.run( [rank6]: File "/home/tx/anaconda3/envs/mt/lib/python3.10/site-packages/hydra/_internal/hydra.py", line 132, in run [rank6]: _ = ret.return_value [rank6]: File "/home/tx/anaconda3/envs/mt/lib/python3.10/site-packages/hydra/core/utils.py", line 260, in return_value [rank6]: raise self._return_value [rank6]: File "/home/tx/anaconda3/envs/mt/lib/python3.10/site-packages/hydra/core/utils.py", line 186, in run_job [rank6]: ret.return_value = task_function(task_cfg) [rank6]: File "/nvme2/chaoshan/FunASR/funasr/bin/train_ds.py", line 56, in main_hydra [rank6]: main(**kwargs) [rank6]: File "/nvme2/chaoshan/FunASR/funasr/bin/train_ds.py", line 95, in main [rank6]: model = AutoModel(**kwargs) [rank6]: File "/nvme2/chaoshan/FunASR/funasr/auto/auto_model.py", line 125, in init [rank6]: model, kwargs = self.build_model(**kwargs) [rank6]: File "/nvme2/chaoshan/FunASR/funasr/auto/auto_model.py", line 270, in build_model [rank6]: load_pretrained_model( [rank6]: File "/nvme2/chaoshan/FunASR/funasr/train_utils/load_pretrained_model.py", line 39, in load_pretrained_model [rank6]: ori_state = torch.load(path, map_location=map_location) [rank6]: File "/home/tx/anaconda3/envs/mt/lib/python3.10/site-packages/torch/serialization.py", line 1114, in load [rank6]: return _legacy_load( [rank6]: File "/home/tx/anaconda3/envs/mt/lib/python3.10/site-packages/torch/serialization.py", line 1338, in _legacy_load [rank6]: magic_number = pickle_module.load(f, **pickle_load_args) [rank6]: _pickle.UnpicklingError: invalid load key, 'v'. [2024-11-05 15:01:42,421][root][INFO] - ckpt: /nvme2/chaoshan/FunASR/examples/industrial_data_pretraining/paraformer/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/model.pt /nvme2/chaoshan/FunASR/funasr/train_utils/load_pretrained_model.py:39: FutureWarning: You are using torch.load with weights_only=False (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for weights_only will be flipped to True. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via torch.serialization.add_safe_globals. We recommend you start setting weights_only=True for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. ori_state = torch.load(path, map_location=map_location) Error executing job with overrides: ['++model=/nvme2/chaoshan/FunASR/examples/industrial_data_pretraining/paraformer/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch', '++train_data_set_list=/nvme2/chaoshanData/1//train_new.jsonl', '++valid_data_set_list=/nvme2/chaoshanData/1//val.jsonl', '++dataset=AudioDataset', '++dataset_conf.index_ds=IndexDSJsonl', '++dataset_conf.data_split_num=1', '++dataset_conf.batch_sampler=BatchSampler', '++dataset_conf.batch_size=2000', '++dataset_conf.sort_size=1024', '++dataset_conf.batch_type=token', '++dataset_conf.num_workers=4', '++train_conf.max_epoch=50', '++train_conf.log_interval=1', '++train_conf.resume=true', '++train_conf.validate_interval=2000', '++train_conf.save_checkpoint_interval=2000', '++train_conf.keep_nbest_models=20', '++train_conf.avg_nbest_model=10', '++train_conf.use_deepspeed=false', '++train_conf.deepspeed_config=/nvme2/chaoshan/FunASR/examples/industrial_data_pretraining/paraformer/../../ds_stage1.json', '++optim_conf.lr=0.0002', '++output_dir=./output'] [rank3]: Traceback (most recent call last): [rank3]: File "/nvme2/chaoshan/FunASR/funasr/bin/train_ds.py", line 225, in [rank3]: main_hydra() [rank3]: File "/home/tx/anaconda3/envs/mt/lib/python3.10/site-packages/hydra/main.py", line 94, in decorated_main [rank3]: _run_hydra( [rank3]: File "/home/tx/anaconda3/envs/mt/lib/python3.10/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra [rank3]: _run_app( [rank3]: File "/home/tx/anaconda3/envs/mt/lib/python3.10/site-packages/hydra/_internal/utils.py", line 457, in _run_app [rank3]: run_and_report( [rank3]: File "/home/tx/anaconda3/envs/mt/lib/python3.10/site-packages/hydra/_internal/utils.py", line 223, in run_and_report [rank3]: raise ex [rank3]: File "/home/tx/anaconda3/envs/mt/lib/python3.10/site-packages/hydra/_internal/utils.py", line 220, in run_and_report [rank3]: return func() [rank3]: File "/home/tx/anaconda3/envs/mt/lib/python3.10/site-packages/hydra/_internal/utils.py", line 458, in [rank3]: lambda: hydra.run( [rank3]: File "/home/tx/anaconda3/envs/mt/lib/python3.10/site-packages/hydra/_internal/hydra.py", line 132, in run [rank3]: _ = ret.return_value [rank3]: File "/home/tx/anaconda3/envs/mt/lib/python3.10/site-packages/hydra/core/utils.py", line 260, in return_value [rank3]: raise self._return_value [rank3]: File "/home/tx/anaconda3/envs/mt/lib/python3.10/site-packages/hydra/core/utils.py", line 186, in run_job [rank3]: ret.return_value = task_function(task_cfg) [rank3]: File "/nvme2/chaoshan/FunASR/funasr/bin/train_ds.py", line 56, in main_hydra [rank3]: main(**kwargs) [rank3]: File "/nvme2/chaoshan/FunASR/funasr/bin/train_ds.py", line 95, in main [rank3]: model = AutoModel(**kwargs) [rank3]: File "/nvme2/chaoshan/FunASR/funasr/auto/auto_model.py", line 125, in init [rank3]: model, kwargs = self.build_model(**kwargs) [rank3]: File "/nvme2/chaoshan/FunASR/funasr/auto/auto_model.py", line 270, in build_model [rank3]: load_pretrained_model( [rank3]: File "/nvme2/chaoshan/FunASR/funasr/train_utils/load_pretrained_model.py", line 39, in load_pretrained_model [rank3]: ori_state = torch.load(path, map_location=map_location) [rank3]: File "/home/tx/anaconda3/envs/mt/lib/python3.10/site-packages/torch/serialization.py", line 1114, in load [rank3]: return _legacy_load( [rank3]: File "/home/tx/anaconda3/envs/mt/lib/python3.10/site-packages/torch/serialization.py", line 1338, in _legacy_load [rank3]: magic_number = pickle_module.load(f, **pickle_load_args) [rank3]: _pickle.UnpicklingError: invalid load key, 'v'. W1105 15:01:42.683000 140432080086848 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2610403 closing signal SIGTERM W1105 15:01:42.684000 140432080086848 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2610404 closing signal SIGTERM W1105 15:01:42.684000 140432080086848 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2610405 closing signal SIGTERM W1105 15:01:42.684000 140432080086848 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2610406 closing signal SIGTERM W1105 15:01:42.684000 140432080086848 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2610407 closing signal SIGTERM W1105 15:01:42.685000 140432080086848 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2610408 closing signal SIGTERM E1105 15:01:43.013000 140432080086848 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 2610402) of binary: /home/tx/anaconda3/envs/mt/bin/python Traceback (most recent call last): File "/home/tx/anaconda3/envs/mt/bin/torchrun", line 33, in sys.exit(load_entry_point('torch==2.4.1', 'console_scripts', 'torchrun')()) File "/home/tx/anaconda3/envs/mt/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 348, in wrapper return f(*args, **kwargs) File "/home/tx/anaconda3/envs/mt/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main run(args) File "/home/tx/anaconda3/envs/mt/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run elastic_launch( File "/home/tx/anaconda3/envs/mt/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/tx/anaconda3/envs/mt/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/nvme2/chaoshan/FunASR/funasr/bin/train_ds.py FAILED

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2024-11-05_15:01:42 host : ubuntu rank : 0 (local_rank: 0) exitcode : 1 (pid: 2610402) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

chengligen avatar Nov 05 '24 07:11 chengligen

Im also getting an error

$bash extract_features.sh

Namespace(checkpoint_dir='/home/ajay/speech_emotion/funasr/emotion2vec_base/emotion2vec_base.pt', granularity='utterance', model_dir='/home/ajay/speech_emotion/funasr/emotion2vec_base', source_file='/home/ajay/speech_emotion/funasr/emotion2vec/scripts/test.wav', target_file='/home/ajay/speech_emotion/funasr/emotion2vec/code/emotion2vec/scripts/test.npy') /home/ajay/speech_emotion/emotion-recognition-using-speech/venv/lib/python3.8/site-packages/fairseq/checkpoint_utils.py:315: FutureWarning: You are using torch.load with weights_only=False (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for weights_only will be flipped to True. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via torch.serialization.add_safe_globals. We recommend you start setting weights_only=True for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. state = torch.load(f, map_location=torch.device("cpu")) Traceback (most recent call last): File "extract_features.py", line 70, in main() File "extract_features.py", line 39, in main model, cfg, task = fairseq.checkpoint_utils.load_model_ensemble_and_task([checkpoint_dir]) File "/home/ajay/speech_emotion/emotion-recognition-using-speech/venv/lib/python3.8/site-packages/fairseq/checkpoint_utils.py", line 425, in load_model_ensemble_and_task state = load_checkpoint_to_cpu(filename, arg_overrides) File "/home/ajay/speech_emotion/emotion-recognition-using-speech/venv/lib/python3.8/site-packages/fairseq/checkpoint_utils.py", line 315, in load_checkpoint_to_cpu state = torch.load(f, map_location=torch.device("cpu")) File "/home/ajay/speech_emotion/emotion-recognition-using-speech/venv/lib/python3.8/site-packages/torch/serialization.py", line 1114, in load return _legacy_load( File "/home/ajay/speech_emotion/emotion-recognition-using-speech/venv/lib/python3.8/site-packages/torch/serialization.py", line 1338, in _legacy_load magic_number = pickle_module.load(f, **pickle_load_args) _pickle.UnpicklingError: invalid load key, 'v'.

Can someone let me know how to do inference?? I am a newbie

Thanks in advance Ajay

nairajay2k avatar Feb 08 '25 08:02 nairajay2k