UER-py
UER-py copied to clipboard
多卡运行报错 TypeError: can't pickle _thread.RLock objects
您好,这边用了最新的代码之后,使用多卡进行预训练就会报错,主要是出现在mp.spawn那一步,错误信息如下:
Traceback (most recent call last):
File "pretrain.py", line 133, in
如果用单卡训练 wordsize=1时,就没有问题。
以下是我的训练脚本:
python pretrain.py
--dataset_path /data/leo/Projects/uer-py-1/corpora/medical_zh_albert_512.pt
--vocab_path /data/leo/Projects/UER-py/models/google_zh_vocab.txt
--pretrained_model_path /data/leo/Projects/UER-py/output_pre/r_512_419/r_512_mlm_from_base_100gpus_110w.bin
--output_model_path outputs/pretrin/pretrain_r_512_medical_zh_albert_512.bin
--config_path /data/leo/Projects/uer-py-1/models/bert_base_config.json
--total_steps 5000000
--save_checkpoint_steps 1000
--report_steps 100
--accumulation_steps 1
--batch_size 25
--tokenizer bert
--embedding word_pos_seg
--encoder transformer
--whole_word_masking
--target albert
--learning_rate 2e-5
--warmup 0.1
--world_size 3
--gpu_ranks 0 1 2
--fp16
请帮助解答一下,谢谢!
版本退回之后没问题了,大概退回到十月份的版本。google相关的问题之后,感觉可能与代码中新加的logger有一定关系,望关注!
好的,感谢!
发件人: LeoWood @.> 发送时间: Thursday, January 6, 2022 3:53:37 PM 收件人: dbiir/UER-py @.> 抄送: Subscribed @.***> 主题: Re: [dbiir/UER-py] 多卡运行报错 TypeError: can't pickle _thread.RLock objects (Issue #248)
版本退回之后没问题了,大概退回到十月份的版本。google相关的问题之后,感觉可能与代码中新加的logger有一定关系,望关注!
― Reply to this email directly, view it on GitHubhttps://github.com/dbiir/UER-py/issues/248#issuecomment-1006353601, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AE3SPVYXOCA2EYWDTFIE7Z3UUVDADANCNFSM5LLL6DMQ. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub. You are receiving this because you are subscribed to this thread.Message ID: @.***>
版本回退到9月份的版本,多卡训练依然报错,单卡训练没问题,辛苦帮忙看一下原因
报错信息:
Traceback (most recent call last):
File "pretrain.py", line 133, in
训练命令:
python3 pretrain.py --dataset_path dataset.pt --vocab_path models/google_zh_vocab.txt
--pretrained_model_path models/google_zh_model.bin
--output_model_path models/book_review_model.bin
--world_size 5 --gpu_ranks 2 4 5 6 7
--total_steps 5000 --save_checkpoint_steps 1000 --batch_size 32
--embedding word_pos_seg --encoder transformer --mask fully_visible --target bert
版本回退到9月份的版本,多卡训练依然报错,单卡训练没问题,辛苦帮忙看一下原因 报错信息: Traceback (most recent call last): File "pretrain.py", line 133, in main() File "pretrain.py", line 129, in main trainer.train_and_validate(args) File "/home/ssd5/liuweile/UER-py/uer/trainer.py", line 54, in train_and_validate mp.spawn(worker, nprocs=args.ranks_num, args=(args.gpu_ranks, args, model), daemon=False) File "/home/liuweile/.local/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/home/liuweile/.local/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 179, in start_processes process.start() File "/home/ssd5/liuweile/Python-3.6.5/lib/python3.6/multiprocessing/process.py", line 105, in start self._popen = self._Popen(self) File "/home/ssd5/liuweile/Python-3.6.5/lib/python3.6/multiprocessing/context.py", line 284, in _Popen return Popen(process_obj) File "/home/ssd5/liuweile/Python-3.6.5/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 32, in init super().init(process_obj) File "/home/ssd5/liuweile/Python-3.6.5/lib/python3.6/multiprocessing/popen_fork.py", line 19, in init self._launch(process_obj) File "/home/ssd5/liuweile/Python-3.6.5/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 47, in _launch reduction.dump(process_obj, fp) File "/home/ssd5/liuweile/Python-3.6.5/lib/python3.6/multiprocessing/reduction.py", line 60, in dump ForkingPickler(file, protocol).dump(obj) TypeError: can't pickle _thread.RLock objects
训练命令: python3 pretrain.py --dataset_path dataset.pt --vocab_path models/google_zh_vocab.txt --pretrained_model_path models/google_zh_model.bin --output_model_path models/book_review_model.bin --world_size 5 --gpu_ranks 2 4 5 6 7 --total_steps 5000 --save_checkpoint_steps 1000 --batch_size 32 --embedding word_pos_seg --encoder transformer --mask fully_visible --target bert
升级python版本应该能解决
感谢,我试一下