PreSumm
PreSumm copied to clipboard
add japanese bert pretrained model
in order to use bert-base-japanese-whole-word-masking
model, installed transformer independently, and fixed few codes.
in order to use
bert-base-japanese-whole-word-masking
model, installed transformer independently, and fixed few codes.
hi guys, Can you give me the ways to get new BERT pre-trained model ? Thank you.
Hello, thanks for your contribution. I notice that you didn't change some functions designed to preprocess the english datasets in the data_builder file, and you use the multilingual model to substitute the old one, so I guess that you use the english datasets to train your Japanese model. Is my guess correct? Look forward to your soonest reply. Thank you!
@congdoanit98 sorry for my too late reply.. I downloaded it from this link. https://huggingface.co/cl-tohoku/bert-base-japanese-whole-word-masking
@beanandrew Yeah, actually the work was not finished, and I had to close this PR... Anyways, I will try to reply your questions.
you didn't change some functions designed to preprocess the english datasets
You're correct. But, I used Japanese dataset.
I wrote some codes to do the same thing in Japanese as step1 to step4 (it simply generating json ), and I preprocessed my dataset by using it. After that, I adapted it to step5. Notice, you have to tokenize the japanese with JapaneseTokenizer from huggingface/transformers .
After that, I changed some part to load pretrained model from bert-base-uncased
to cl-tohoku/bert-base-japanese-whole-word-masking
. Then I did the training.
You also have to add these words ("[unused0]", "[unused1]", "[unused2]", "[unused3]", "[unused4]", "[unused5]", "[unused6]"
) to token.txt
I think these were all changes. If things above does not work, don't hesitate to ask me.
Thanks.
Thanks for your reply.
I am also trying to transferring this work to another language, and before I see your work, I noticed that, even if I have finished work from step1 to step4 with my own code, and only use the format_to_bert function to finish the step 5, some functions still need to be changed.
For example, the _rouge_clean()
function in file data_builder.py
in src as followed, is used in the step 5 to clean up all the punctuations in my sentence. But actually, it does this by removing all non-a-z and 0-9 characters, it means that, words in Japanese will be removed, and return an empty list as the sent_labels.
def _rouge_clean(s): return re.sub(r'[^a-zA-Z0-9 ]', '', s)
I am still on the way to change the code, and I haven't do experiments on this, so I wanna know how you solve problems like this, or after your experiments, these problems won't change the experiment results? Hopes for your reply, Thank you!
Well, You are correct again..
I didn't care about the _rouge_clean
function or some others in format_to_bert function, but it seems to be critical to the generated dataset..
I conducted my experiments by using the original step5 codes, and I could generate some summaries, and it showed reasonable result. Though, the code you mentioned seems to have huge impact on EXT result, which can lead the bad EXTABS result. I have to modify it and have to re-experiment it, again.
I hope this information will support your work.
Thanks for your reply, your answer helps me lot!
By the way, I wanna ask you a question about the convert_tokens_to_ids()
function of class BertTokenizer
in tokenization.py
file, path src/others.
def convert_tokens_to_ids(self, tokens):
"""Converts a sequence of tokens into ids using the vocab."""
ids = []
for token in tokens:
if(token in self.never_split):
continue
else:
ids.append(self.vocab[token])
return ids
When I try to debug the code in the Step5, I notice that, because of this code, the labels like [CLS] and [SEP] are skipped, and that cause the Error "CUDAType Error" when I use the dataset it preprocessed. Thus, I changed the code as followed, and the Error no longer appear.
def convert_tokens_to_ids(self, tokens):
"""Converts a sequence of tokens into ids using the vocab."""
ids = []
for token in tokens:
ids.append(self.vocab[token])
return ids
I want to know, do you come with Errors like this when you use the original code? Or is this just my personal problem? Also, I wanna know that, if I want to test my model in other language with the 'mode -test', should I do some other changes in the code? Hopes for your reply. Thank you!
Hmm, I couldn't find that code you mentioned on master branch. Anyways, your suggestion seems to work beautifully. Check it out again, please. https://github.com/nlpyang/PreSumm/blame/master/src/others/tokenization.py#L108
I remember that I commented out the code below (rouge score calculator), because I couldn't solve errors from pyrouge library. https://github.com/nlpyang/PreSumm/blame/master/src/models/predictor.py#L188
Instead of the pyrouge library, I used this one. I fixed the code to export the summarized text, and after all results were written, I evaluated the rouge score. If you could correctly install the pyrouge library, you don't have to care I think.
Additionally, the token size is very important. In Japanese token.txt, I couldn't find something like [unused0]
, so I expanded the token.txt to support them. I think you already know what I mean, but, to make sure :)
Thank you very much for your quick reply!
I followed your advise and check the master brach, and found that this code was not there. Also, I check other branches and my repo, only to find its nowhere to find... maybe I copied a wrong version of the project and it lead to this confusing error.
Your change from token [unused0]
to [unused7]
is very useful, before I see your work, I didn't know how to solve this problem, and tried to add the [unused0] to the vocab roughly... Thanks to your work!
Now I can finally do experiments on my datasets with your help. I will contact you if I have some new findings~