UER-py icon indicating copy to clipboard operation
UER-py copied to clipboard

多卡运行报错 TypeError: can't pickle _thread.RLock objects

Open LeoWood opened this issue 3 years ago • 5 comments

您好,这边用了最新的代码之后,使用多卡进行预训练就会报错,主要是出现在mp.spawn那一步,错误信息如下: Traceback (most recent call last): File "pretrain.py", line 133, in main() File "pretrain.py", line 129, in main trainer.train_and_validate(args) File "/data/leo/Projects/uer-py-1/uer/trainer.py", line 56, in train_and_validate mp.spawn(worker, nprocs=args.ranks_num, args=(args.gpu_ranks, args, model), daemon=False) File "/home/leo/anaconda3/envs/py36/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/home/leo/anaconda3/envs/py36/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 149, in start_processes process.start() File "/home/leo/anaconda3/envs/py36/lib/python3.6/multiprocessing/process.py", line 105, in start self._popen = self._Popen(self) File "/home/leo/anaconda3/envs/py36/lib/python3.6/multiprocessing/context.py", line 284, in _Popen return Popen(process_obj) File "/home/leo/anaconda3/envs/py36/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 32, in init super().init(process_obj) File "/home/leo/anaconda3/envs/py36/lib/python3.6/multiprocessing/popen_fork.py", line 19, in init self._launch(process_obj) File "/home/leo/anaconda3/envs/py36/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 47, in _launch reduction.dump(process_obj, fp) File "/home/leo/anaconda3/envs/py36/lib/python3.6/multiprocessing/reduction.py", line 60, in dump ForkingPickler(file, protocol).dump(obj) TypeError: can't pickle _thread.RLock objects

如果用单卡训练 wordsize=1时,就没有问题。

以下是我的训练脚本:

python pretrain.py
--dataset_path /data/leo/Projects/uer-py-1/corpora/medical_zh_albert_512.pt
--vocab_path /data/leo/Projects/UER-py/models/google_zh_vocab.txt
--pretrained_model_path /data/leo/Projects/UER-py/output_pre/r_512_419/r_512_mlm_from_base_100gpus_110w.bin
--output_model_path outputs/pretrin/pretrain_r_512_medical_zh_albert_512.bin
--config_path /data/leo/Projects/uer-py-1/models/bert_base_config.json
--total_steps 5000000
--save_checkpoint_steps 1000
--report_steps 100
--accumulation_steps 1
--batch_size 25
--tokenizer bert
--embedding word_pos_seg
--encoder transformer
--whole_word_masking
--target albert
--learning_rate 2e-5
--warmup 0.1
--world_size 3
--gpu_ranks 0 1 2
--fp16

请帮助解答一下,谢谢!

LeoWood avatar Jan 06 '22 03:01 LeoWood

版本退回之后没问题了,大概退回到十月份的版本。google相关的问题之后,感觉可能与代码中新加的logger有一定关系,望关注!

LeoWood avatar Jan 06 '22 07:01 LeoWood

好的,感谢!


发件人: LeoWood @.> 发送时间: Thursday, January 6, 2022 3:53:37 PM 收件人: dbiir/UER-py @.> 抄送: Subscribed @.***> 主题: Re: [dbiir/UER-py] 多卡运行报错 TypeError: can't pickle _thread.RLock objects (Issue #248)

版本退回之后没问题了,大概退回到十月份的版本。google相关的问题之后,感觉可能与代码中新加的logger有一定关系,望关注!

― Reply to this email directly, view it on GitHubhttps://github.com/dbiir/UER-py/issues/248#issuecomment-1006353601, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AE3SPVYXOCA2EYWDTFIE7Z3UUVDADANCNFSM5LLL6DMQ. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub. You are receiving this because you are subscribed to this thread.Message ID: @.***>

ydli-ai avatar Jan 06 '22 09:01 ydli-ai

版本回退到9月份的版本,多卡训练依然报错,单卡训练没问题,辛苦帮忙看一下原因 报错信息: Traceback (most recent call last): File "pretrain.py", line 133, in main() File "pretrain.py", line 129, in main trainer.train_and_validate(args) File "/home/ssd5/liuweile/UER-py/uer/trainer.py", line 54, in train_and_validate mp.spawn(worker, nprocs=args.ranks_num, args=(args.gpu_ranks, args, model), daemon=False) File "/home/liuweile/.local/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/home/liuweile/.local/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 179, in start_processes process.start() File "/home/ssd5/liuweile/Python-3.6.5/lib/python3.6/multiprocessing/process.py", line 105, in start self._popen = self._Popen(self) File "/home/ssd5/liuweile/Python-3.6.5/lib/python3.6/multiprocessing/context.py", line 284, in _Popen return Popen(process_obj) File "/home/ssd5/liuweile/Python-3.6.5/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 32, in init super().init(process_obj) File "/home/ssd5/liuweile/Python-3.6.5/lib/python3.6/multiprocessing/popen_fork.py", line 19, in init self._launch(process_obj) File "/home/ssd5/liuweile/Python-3.6.5/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 47, in _launch reduction.dump(process_obj, fp) File "/home/ssd5/liuweile/Python-3.6.5/lib/python3.6/multiprocessing/reduction.py", line 60, in dump ForkingPickler(file, protocol).dump(obj) TypeError: can't pickle _thread.RLock objects

训练命令: python3 pretrain.py --dataset_path dataset.pt --vocab_path models/google_zh_vocab.txt
--pretrained_model_path models/google_zh_model.bin
--output_model_path models/book_review_model.bin
--world_size 5 --gpu_ranks 2 4 5 6 7
--total_steps 5000 --save_checkpoint_steps 1000 --batch_size 32
--embedding word_pos_seg --encoder transformer --mask fully_visible --target bert

hfutweile avatar Jan 28 '22 07:01 hfutweile

版本回退到9月份的版本,多卡训练依然报错,单卡训练没问题,辛苦帮忙看一下原因 报错信息: Traceback (most recent call last): File "pretrain.py", line 133, in main() File "pretrain.py", line 129, in main trainer.train_and_validate(args) File "/home/ssd5/liuweile/UER-py/uer/trainer.py", line 54, in train_and_validate mp.spawn(worker, nprocs=args.ranks_num, args=(args.gpu_ranks, args, model), daemon=False) File "/home/liuweile/.local/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/home/liuweile/.local/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 179, in start_processes process.start() File "/home/ssd5/liuweile/Python-3.6.5/lib/python3.6/multiprocessing/process.py", line 105, in start self._popen = self._Popen(self) File "/home/ssd5/liuweile/Python-3.6.5/lib/python3.6/multiprocessing/context.py", line 284, in _Popen return Popen(process_obj) File "/home/ssd5/liuweile/Python-3.6.5/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 32, in init super().init(process_obj) File "/home/ssd5/liuweile/Python-3.6.5/lib/python3.6/multiprocessing/popen_fork.py", line 19, in init self._launch(process_obj) File "/home/ssd5/liuweile/Python-3.6.5/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 47, in _launch reduction.dump(process_obj, fp) File "/home/ssd5/liuweile/Python-3.6.5/lib/python3.6/multiprocessing/reduction.py", line 60, in dump ForkingPickler(file, protocol).dump(obj) TypeError: can't pickle _thread.RLock objects

训练命令: python3 pretrain.py --dataset_path dataset.pt --vocab_path models/google_zh_vocab.txt --pretrained_model_path models/google_zh_model.bin --output_model_path models/book_review_model.bin --world_size 5 --gpu_ranks 2 4 5 6 7 --total_steps 5000 --save_checkpoint_steps 1000 --batch_size 32 --embedding word_pos_seg --encoder transformer --mask fully_visible --target bert

升级python版本应该能解决

cdd1993 avatar Feb 05 '22 15:02 cdd1993

感谢,我试一下

hfutweile avatar Feb 08 '22 12:02 hfutweile