PreSumm add japanese bert pretrained model

in order to use bert-base-japanese-whole-word-masking model, installed transformer independently, and fixed few codes.

Jan 24 '20 03:01 gorogoroyasu

in order to use bert-base-japanese-whole-word-masking model, installed transformer independently, and fixed few codes.

hi guys, Can you give me the ways to get new BERT pre-trained model ? Thank you.

Nov 29 '20 14:11 congdoanit98

Hello, thanks for your contribution. I notice that you didn't change some functions designed to preprocess the english datasets in the data_builder file, and you use the multilingual model to substitute the old one, so I guess that you use the english datasets to train your Japanese model. Is my guess correct? Look forward to your soonest reply. Thank you！

Jan 10 '21 11:01 beanandrew

@congdoanit98 sorry for my too late reply.. I downloaded it from this link. https://huggingface.co/cl-tohoku/bert-base-japanese-whole-word-masking

@beanandrew Yeah, actually the work was not finished, and I had to close this PR... Anyways, I will try to reply your questions.

you didn't change some functions designed to preprocess the english datasets

You're correct. But, I used Japanese dataset. I wrote some codes to do the same thing in Japanese as step1 to step4 (it simply generating json ), and I preprocessed my dataset by using it. After that, I adapted it to step5. Notice, you have to tokenize the japanese with JapaneseTokenizer from huggingface/transformers . After that, I changed some part to load pretrained model from bert-base-uncased to cl-tohoku/bert-base-japanese-whole-word-masking. Then I did the training. You also have to add these words ("[unused0]", "[unused1]", "[unused2]", "[unused3]", "[unused4]", "[unused5]", "[unused6]") to token.txt

I think these were all changes. If things above does not work, don't hesitate to ask me.

Thanks.

Jan 12 '21 10:01 gorogoroyasu

Thanks for your reply. I am also trying to transferring this work to another language, and before I see your work, I noticed that, even if I have finished work from step1 to step4 with my own code, and only use the format_to_bert function to finish the step 5, some functions still need to be changed. For example, the _rouge_clean() function in file data_builder.py in src as followed, is used in the step 5 to clean up all the punctuations in my sentence. But actually, it does this by removing all non-a-z and 0-9 characters, it means that, words in Japanese will be removed, and return an empty list as the sent_labels.

def _rouge_clean(s):
    return re.sub(r'[^a-zA-Z0-9 ]', '', s)

I am still on the way to change the code, and I haven't do experiments on this, so I wanna know how you solve problems like this, or after your experiments, these problems won't change the experiment results? Hopes for your reply, Thank you!

Jan 12 '21 11:01 beanandrew

Well, You are correct again.. I didn't care about the _rouge_clean function or some others in format_to_bert function, but it seems to be critical to the generated dataset..

I conducted my experiments by using the original step5 codes, and I could generate some summaries, and it showed reasonable result. Though, the code you mentioned seems to have huge impact on EXT result, which can lead the bad EXTABS result. I have to modify it and have to re-experiment it, again.

I hope this information will support your work.

Jan 12 '21 12:01 gorogoroyasu

Thanks for your reply, your answer helps me lot! By the way, I wanna ask you a question about the convert_tokens_to_ids() function of class BertTokenizer in tokenization.py file, path src/others.

    def convert_tokens_to_ids(self, tokens):
        """Converts a sequence of tokens into ids using the vocab."""
        ids = []

        for token in tokens:
            if(token in self.never_split):
                continue
            else:
                ids.append(self.vocab[token])
        return ids

When I try to debug the code in the Step5, I notice that, because of this code, the labels like [CLS] and [SEP] are skipped, and that cause the Error "CUDAType Error" when I use the dataset it preprocessed. Thus, I changed the code as followed, and the Error no longer appear.

    def convert_tokens_to_ids(self, tokens):
        """Converts a sequence of tokens into ids using the vocab."""
        ids = []

        for token in tokens:
            ids.append(self.vocab[token])
        return ids

I want to know, do you come with Errors like this when you use the original code? Or is this just my personal problem? Also, I wanna know that, if I want to test my model in other language with the 'mode -test', should I do some other changes in the code? Hopes for your reply. Thank you!

Jan 12 '21 12:01 beanandrew

Hmm, I couldn't find that code you mentioned on master branch. Anyways, your suggestion seems to work beautifully. Check it out again, please. https://github.com/nlpyang/PreSumm/blame/master/src/others/tokenization.py#L108

I remember that I commented out the code below (rouge score calculator), because I couldn't solve errors from pyrouge library. https://github.com/nlpyang/PreSumm/blame/master/src/models/predictor.py#L188

Instead of the pyrouge library, I used this one. I fixed the code to export the summarized text, and after all results were written, I evaluated the rouge score. If you could correctly install the pyrouge library, you don't have to care I think.

Additionally, the token size is very important. In Japanese token.txt, I couldn't find something like [unused0], so I expanded the token.txt to support them. I think you already know what I mean, but, to make sure :)

Jan 12 '21 13:01 gorogoroyasu

Thank you very much for your quick reply! I followed your advise and check the master brach, and found that this code was not there. Also, I check other branches and my repo, only to find its nowhere to find... maybe I copied a wrong version of the project and it lead to this confusing error. Your change from token [unused0] to [unused7] is very useful, before I see your work, I didn't know how to solve this problem, and tried to add the [unused0] to the vocab roughly... Thanks to your work! Now I can finally do experiments on my datasets with your help. I will contact you if I have some new findings~

Jan 12 '21 14:01 beanandrew

PreSumm PreSumm copied to clipboard

add japanese bert pretrained model

PreSumm
PreSumm copied to clipboard