ChineseBert icon indicating copy to clipboard operation
ChineseBert copied to clipboard

关于预训练

Open laikaiting opened this issue 3 years ago • 7 comments

请问是否有开源预训练的代码呢

laikaiting avatar Oct 14 '21 06:10 laikaiting

同问

Dioxideme avatar Oct 20 '21 07:10 Dioxideme

Could you release the pretraining code? I may have difficulty in masking pinyin ids and original ids. With best regards, Yunpeng Tai

sherlcok314159 avatar Oct 20 '21 13:10 sherlcok314159

We followed Hugginface‘s pre-training scripts, you can replace the origin BertModel with our GlyceBertModel easily: here is the link: https://github.com/huggingface/transformers/tree/master/examples/pytorch/language-modeling

zijunsun avatar Oct 21 '21 09:10 zijunsun

We followed Hugginface‘s pre-training scripts, you can replace the origin BertModel with our GlyceBertModel easily: here is the link: https://github.com/huggingface/transformers/tree/master/examples/pytorch/language-modeling

尝试了好久,并不easily呀 求开源pre-train的代码

jw8023wh avatar Nov 30 '21 12:11 jw8023wh

请问是否有开源预训练的代码呢

请问,您是否有跑通Pre-train的过程呢? 我在尝试利用自己的数据去做Pre-train, 踩坑好久

jw8023wh avatar Dec 01 '21 03:12 jw8023wh

我们的模型就是预训练出来的,你跑上面的language model pretrain有什么问题呢,可以发下错误截图。

zijunsun avatar Jan 11 '22 03:01 zijunsun

我们的模型就是预训练出来的,你跑上面的language model pretrain有什么问题呢,可以发下错误截图。

请问在run_mlm.py的预训练过程中,是要将

tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path, **tokenizer_kwargs)

model = AutoModelForMaskedLM.from_pretrained( model_args.model_name_or_path, from_tf=bool(".ckpt" in model_args.model_name_or_path), config=config, cache_dir=model_args.cache_dir, revision=model_args.model_revision, use_auth_token=True if model_args.use_auth_token else None, )

分别替换为

tokenizer = BertMaskDataset(vocab_file, config_path)

model = GlyceBertForMaskedLM.from_pretrained(model_args.model_name_or_path)

吗?谢谢!

cxyccc avatar Aug 03 '22 08:08 cxyccc