K-BERT icon indicating copy to clipboard operation
K-BERT copied to clipboard

Fine-tune on English Corpus

Open 106753004 opened this issue 4 years ago • 15 comments

I use Bert (model and tokenizer) to change K-BERT to the English version K-BERT. However, I got poor scores on the classification tasks. If you have K-BERT code of fine-tuning on English Corpus, could you please release it?

106753004 avatar Apr 21 '20 20:04 106753004

For English, please use:

Model: https://share.weiyun.com/5hWivED Vocab: https://share.weiyun.com/5gBxBYD

However, there is no English KG file suitable for K-BERT. What KG do you use?

autoliuweijie avatar Apr 22 '20 03:04 autoliuweijie

Hello, @106753004 @autoliuweijie I also want to implement K-BERT on the English corpus. @autoliuweijie the model you mentioned is Google pre-trained Bert on Wiki right, or have you already done some fine-tuning on it? Indeed I use Google Bert (Englis) as the base model and Wikidata (Download link) KG to fine-tune new K-BERT for classification tasks but fail to get good performance.

Actually, I refered to ERNIE and wondered if K-BERT can incorporate Wikidata KG and fine-tune on the different domain datasets such as TACRED, Open Entity. I extracted triples from KG and tokenized them with Bert tokenizer applying the same way to insert into the sentence. Then, followed the same procedure in the paper. Is there any problem with my implementation?

yushengsu-thu avatar Apr 22 '20 07:04 yushengsu-thu

擷取 Hello, it seems that the vocab file cannot be downloaded

WenTingTseng avatar Apr 25 '20 07:04 WenTingTseng

Hello, it is difficult to download the models if you don't have an account on wechat or QQ. Can you make it accessible without a login? thanks

inezvl avatar Jun 12 '20 14:06 inezvl

Hello,

Thanks for your sharing! The Model file can be successfully downloaded. Any chance that you could upload the corresponding Vocab file?

Thank you so much!

ankechiang avatar Jun 27 '20 01:06 ankechiang

Hello,

Thanks for your sharing! The Model file can be successfully downloaded. Any chance that you could upload the corresponding Vocab file?

Thank you so much!

Sorry. I don't know what the reason is that the vocab file we uploaded is considered illegal and has been deleted by the administrator. We are dealing with it and releasing the file as soon as possible.

autoliuweijie avatar Jun 27 '20 02:06 autoliuweijie

Hello, it is difficult to download the models if you don't have an account on wechat or QQ. Can you make it accessible without a login? thanks

Sorry, we are looking for other free network disk storage.

autoliuweijie avatar Jun 27 '20 02:06 autoliuweijie

Hello,

Thanks for your sharing! The Model file can be successfully downloaded. Any chance that you could upload the corresponding Vocab file?

Thank you so much!

you can get the corresponding vocab file from UER project:

https://github.com/dbiir/UER-py/blob/master/models/google_uncased_en_vocab.txt

autoliuweijie avatar Jun 27 '20 03:06 autoliuweijie

It works. Thanks for clarification!

ankechiang avatar Jun 27 '20 03:06 ankechiang

Hey, With regards to english. I extracted some domain specific triples from english dbpedia, so this aspect is covered. I have used a pytorch script to convert cased bert base to the bin file required by uer. I the model loss doesn't decrease however, I see that the a method <add_knowledge_with_vm> starts with word level then breaks down to individual characters. Presumably this is for Chinese character level embeddings, is there a version for english WordPieceEncoding, perhaps BytePairEncoding or even whole word? Many thanks and great work!

EdwardBurgin avatar Jul 02 '20 19:07 EdwardBurgin

Hey, With regards to english. I extracted some domain specific triples from english dbpedia, so this aspect is covered. I have used a pytorch script to convert cased bert base to the bin file required by uer. I the model loss doesn't decrease however, I see that the a method <add_knowledge_with_vm> starts with word level then breaks down to individual characters. Presumably this is for Chinese character level embeddings, is there a version for english WordPieceEncoding, perhaps BytePairEncoding or even whole word? Many thanks and great work!

hello,I am a freshman student in this domain, and I also want to apply this model into English corpus. I wish you could have time to give me some advice for few questions. 1.have you solved the problem that use english WordPieceEncoding? 2.I don't know how to extract domain specific triples from domain english dbpedia(such like the domain in computer science),could you give me some advice.

thank you in advance! I am waiting for you reply.

Jiaxin-Liu-96 avatar Nov 26 '20 09:11 Jiaxin-Liu-96

english dbpedia

Hello, can you share the triples (English) and the Bert model for testing purposes? `Did it finally work?

vsrana-ai avatar Jan 26 '21 09:01 vsrana-ai

I use Bert (model and tokenizer) to change K-BERT to the English version K-BERT. However, I got poor scores on the classification tasks. If you have K-BERT code of fine-tuning on English Corpus, could you please release it?

is the english dataset finally work? Thanks very much.

zhuchenxi avatar Feb 22 '21 09:02 zhuchenxi

Hello, I am a student working on a textual classification task and I m trying to use K-BERT over a dataset which is purely in English. Though I understand the implementation strategies in K-BERT, I am a little lost on how to implement them over a corpus of data that is purely in English. I see that the vocab file shared by @autoliuweijie is somehow not accessible. It would be great if you could give me a sense of direction on where to start.

Thank you

vishprivenkat avatar Sep 19 '22 20:09 vishprivenkat

您好,我已经收到了您的邮件,我会尽快回复!祝您生活愉快!

Jiaxin-Liu-96 avatar Sep 19 '22 20:09 Jiaxin-Liu-96