bert How to use my own additional vocabulary dictionary?

How to use my own additional vocabulary dictionary?

Open EnteLee opened this issue 5 years ago • 52 comments

Hello! We are Korean students. We would like to implement a Korean slang filtering system as your BERT model.

A test is in progress by fine-tuning the CoLA task on run_classifier.py from the existing multilingual model. However, I feel a lack of a dictionary of existing words and want to use a BERT model with pre-trained weight by adding words to vocab.txt. However, modifications to the vocab.txt and vert_config.json files do not match the shape stored in the initial bert_model.ckpt.

I'd like to use the pre-training weight with our own additional vocabulary dictionary. Is there a way to modify this specification? Or are we forced to pre-training from the scratch?

Thank you.

cf. we have this error message! INFO:tensorflow:Error recorded from evaluation_loop: indices[6,33,0] = 105932 is not in [0, 105879) [[node bert/embeddings/embedding_lookup (defined at /home/user01/code/bert/modeling.py:421) = GatherV2[Taxis=DT_INT32, Tindices=DT_INT32, Tparams=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](bert/embeddings/word_embeddings/read, bert/embeddings/ExpandDims, bert/embeddings/embedding_lookup/axis)]]

Jan 24 '19 08:01 EnteLee

Hi!

I think the information you are looking for is in the readme file: https://github.com/google-research/bert#learning-a-new-wordpiece-vocabulary

Jan 24 '19 09:01 rodgzilla

Pretrain from scratch or modify the first ~1000 lines of the vocab.txt file with the vocab you'd like to add.

Feb 02 '19 16:02 bradfox2

I noticed that there are a LOT of single foreign characters in the vocab.txt file. I'm wondering whether one could also remove these and replace them with words for fine tuning?

Feb 06 '19 12:02 bsugerman

@bsugerman seems u really couldn't avoid all the foreign characters when u use a very large corpus to train the models and also couldn't replace (could delete them somehow) them till u train the model all over all again because params in embedding layer is part of the pre-trained model.

Feb 19 '19 06:02 yzho0907

I have a domain specific (medical) English corpus that I want to do some additional pre-training on from the Bert checkpoint. However, there are quite a lot of words in the medical vocabulary not present in the vocab.txt-file.

Lets say I want to add the top 500 words in the corpus not already in the vocabulary. Is this as easy as just replacing the [unused#] in the vocab.txt? No additional changes to the bert_config.json?

Apr 02 '19 17:04 peregilk

I have similar question as above @peregilk , how to add domain specific vocab.txt in any language other then english, in their official repo it "This repository does not include code for learning a new WordPiece vocabulary , there are a number of open source options available. However, keep in mind that these are not compatible with our tokenization.py library:" then how do we learn domain specific vocab, as with availbile multilingual pretraining weights bert didnt perform well on downstream classfication task on urdu corpus. or we can create workaround for domain specific vocab by modifying ~1000 lines of vocab.txt as suggested by @bradfox2

Apr 14 '19 15:04 samreenkazi

@samreenkazi. I ended up using Spacy to make a list of all the words in a portion of the corpus. There is easy built in functions for listing for instance the 10.000 most common words in the text. I then checked this against the bert vocab file, and ended up adding roughly 400 words in the empty spots in the vocab-file.

I did a few test, and on my very specific medical language, it seemed to have good effect. However, I noticed that it need quite a lot of pre-training to outperform the standard vocabulary. I trained a couple of days on a 2080Ti until it was better (logical since the weights for the new vocab is initialised from scratch).

I am not sure if this answers your question with the urdu corpus. However, if you like to have a look at the script I used for building the vocab-file, just send me a message.

Apr 30 '19 07:04 peregilk

Yes I would like to have a look at script for building the vocab

Apr 30 '19 08:04 samreenkazi

please contact me on "per at capia dot no". Ill send you the code.

On Tue, 30 Apr 2019 at 10:53, Samreen Kazi [email protected] wrote:

Yes I would like to have a look at script for building the vocab

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/google-research/bert/issues/396#issuecomment-487868806, or mute the thread https://github.com/notifications/unsubscribe-auth/ACFIYACNU45AZZMT52DNGXTPTACJBANCNFSM4GSBNJ3A .

Apr 30 '19 09:04 peregilk

@peregilk Are you able to push that code up to a repo and link back here? It would be useful for many.

May 02 '19 04:05 bradfox2

@bradfox2 , @peregilk You can use a modified version of Tensor2Tensor/text_encoder_build_subword.py code to generate BERT compatible vocab. https://github.com/kwonmha/bert-vocab-builder

May 02 '19 12:05 Dhanachandra

@peregilk Are you able to push that code up to a repo and link back here? It would be useful for many.

or perhaps post the code on https://gist.github.com/ - its free of cost

May 02 '19 17:05 techmattersinc

Hi, I would like to confirm the idea to add in an unseen word. Suppose I have a new word "xyzw". To include this word, the easiest approach is to replace [uncase1] with "xyzw" in the vocab.txt. Then, I will need to run fine-tuning on my specialized data so that the word vector for "xyzw" can be learned. Is this the correct idea?

May 04 '19 01:05 datduong

@bradfox2 , @peregilk You can use a modified version of Tensor2Tensor/text_encoder_build_subword.py code to generate BERT compatible vocab. https://github.com/kwonmha/bert-vocab-builder

That is also available in the BERT repo. The question is more around the use of some already developed, easy to use vocab comparison scripts.

May 04 '19 04:05 bradfox2

Pretrain from scratch or modify the first ~1000 lines of the vocab.txt file with the vocab you'd like to add.

I also need to add some a few thousands new tokens that don't exist in the Bert vocab file. When I check the vocab file of the model (multi_cased_L-12_H-768_A-12), first 100 tokens are "unused" tokens ( unused0-unused99) and they continue with [UNK], [CLS], [SEP], [MASK] tokens. I think you wouldn't suggest modifying these tokens and also the numbers and letters which come right after them. Because they are in the first ~1000 lines. Can you help me see what i am missing here?

Shouldn't we just modify the tokens that aren't likely to exist in the corpus we use for fine-tuning ?

Jun 21 '19 10:06 irhallac

@irhallac it is the [unusedXXX]-tokens that can be replaced with any word you like. I am running some experiments on how effective this really is, but from my understanding you should prioritise words that are frequent in the domain specific dictionary, not in vocab.txt, and very the current tokenization is unlikely to be any good. You should also take into account that Bert is very good in tokenizing long words.

Lets say you have an english football specific corpus. You notice that the the word "footballs" is not in vocab.txt. It is however meaningless to add it. Bert tokenizes this as "football"+"##s" -> [2375] [2016] (look at the line numbers in vocab.txt), and have already learned a very good representation both for the individual tokens and the combination. However your text is a lot about football stadiums, and you see that "Anfield" is not in the vocab.txt. This will be tokenized as "an"+"##field". However, it is reason to believe that the current learned word embedding is not very useful.

If you add "anfield" to one of the unused spots in vocab.txt, and then to pre-training from the last checkpoint, this vectorised embedding will just start from random(think "0") and it might learn the word faster since it will not be confused by other uses of the tokens [an] and [##field]. This is my understanding of how this works.

It is important to remember that the line-numbers in vocab-txt matters. For instance is "!" at line #1000, and it should be also after you edited the file. For pre-training from existing checkpoint you should not change the size of the vocab-file. This means you have around 1000 extra words for your disposal. For pre-training from scratch, build a new vocab based only on your corpus.

Since Bert does an excellent job in tokenising and learning this combinations, do not expect dramatic improvements by adding words to the vocab. In my experience adding very specific terms, like common long medical latin words, have some effect. Adding words like "footballs" will likely just have negative effects since the current vector is already pretty good.

Jun 21 '19 10:06 peregilk

@peregilk thank you. In the model i downloaded there are only 100 [unusedXXX]-tokens in the vocab.txt not 1000. But you say 1000 can be changed ?

Jun 21 '19 11:06 irhallac

They are in two bulks. From line#2 [unused0] to line #100[unused98]. Then there are 4 tokens that absolutely should not be changed [UNK]|CLS][SEP][MASK].

Then they continue. line #105 to line #999. Totally roughly 1000 unused tokens.

Jun 21 '19 11:06 peregilk

@peregilk btw I want to use the Bert model on Turkish language. I downloaded it from download_url = 'https://storage.googleapis.com/bert_models/2018_11_23/multi_cased_L-12_H-768_A-12.zip'
and it is like this:

.
[unused97]
[unused98]
[unused99]
[UNK]
[CLS]
[SEP]
[MASK]
<S>
<T>
!
"
#
$
%
&
'
(
)
*
+
,
-
.
/
0
1
2
.

Jun 21 '19 11:06 irhallac

OK. I did not know. Then it is only the uncased version that has 1000 unused spots.

Jun 21 '19 11:06 peregilk

Alternatively, can you also remove non-English words and the rare symbols ? Would this significantly affect the model ?

Jun 21 '19 23:06 datduong

They are in two bulks. From line#2 [unused0] to line #100[unused98]. Then there are 4 tokens that absolutely should not be changed [UNK]|CLS][SEP][MASK].

Then they continue. line #105 to line #999. Totally roughly 1000 unused tokens.

Does that mean I can't add more than 1000 words?

Jun 24 '19 21:06 bhoomit

@samreenkazi. I ended up using Spacy to make a list of all the words in a portion of the corpus. There is easy built in functions for listing for instance the 10.000 most common words in the text. I then checked this against the bert vocab file, and ended up adding roughly 400 words in the empty spots in the vocab-file.

I did a few test, and on my very specific medical language, it seemed to have good effect. However, I noticed that it need quite a lot of pre-training to outperform the standard vocabulary. I trained a couple of days on a 2080Ti until it was better (logical since the weights for the new vocab is initialised from scratch).

I am not sure if this answers your question with the urdu corpus. However, if you like to have a look at the script I used for building the vocab-file, just send me a message.

@peregilk can you please share the code to modify vocabulary and then pre-training it to adapt to the new vocab. Also, can to tell me the metrics on basis of which you decided that the weights were better

Jul 18 '19 11:07 jinamshah

Pretrain from scratch or modify the first ~1000 lines of the vocab.txt file with the vocab you'd like to add.

@bradfox2 What are we supposed to do after these changes? How is the model retrained?

Jul 26 '19 20:07 ivanacorovic

@irhallac it is the [unusedXXX]-tokens that can be replaced with any word you like. I am running some experiments on how effective this really is, but from my understanding you should prioritise words that are frequent in the domain specific dictionary, not in vocab.txt, and very the current tokenization is unlikely to be any good. You should also take into account that Bert is very good in tokenizing long words.

Lets say you have an english football specific corpus. You notice that the the word "footballs" is not in vocab.txt. It is however meaningless to add it. Bert tokenizes this as "football"+"##s" -> [2375] [2016] (look at the line numbers in vocab.txt), and have already learned a very good representation both for the individual tokens and the combination. However your text is a lot about football stadiums, and you see that "Anfield" is not in the vocab.txt. This will be tokenized as "an"+"##field". However, it is reason to believe that the current learned word embedding is not very useful.

If you add "anfield" to one of the unused spots in vocab.txt, and then to pre-training from the last checkpoint, this vectorised embedding will just start from random(think "0") and it might learn the word faster since it will not be confused by other uses of the tokens [an] and [##field]. This is my understanding of how this works.

It is important to remember that the line-numbers in vocab-txt matters. For instance is "!" at line #1000, and it should be also after you edited the file. For pre-training from existing checkpoint you should not change the size of the vocab-file. This means you have around 1000 extra words for your disposal. For pre-training from scratch, build a new vocab based only on your corpus.

Since Bert does an excellent job in tokenising and learning this combinations, do not expect dramatic improvements by adding words to the vocab. In my experience adding very specific terms, like common long medical latin words, have some effect. Adding words like "footballs" will likely just have negative effects since the current vector is already pretty good.

@peregilk I am also working in the medical domain, can you please share those common long medical latin words that you added in vocab.

Aug 27 '19 12:08 Dhanachandra

@irhallac Let me post an update on my experiences with using vocab-files during pretraining on a domain specific corpus. As far as I know, the only reasonable way to test if this work is to validate it by also fine-tuning the pretrained networks. You will have to do this multiple times before you get reliable results.

My initial experiments indicated that adding custom words to the vocab-file had some effects. However, at least on my corpus that can be described as "medical tweets", this effect just disappears after running the domain specific pretraining for a while.

After spending quite some time on this, I have ended up dropping the custom vocab-files totally. Bert seems to be able to learn these specialised words by tokenizing them.

Aug 27 '19 13:08 peregilk

Pretrain from scratch or modify the first ~1000 lines of the vocab.txt file with the vocab you'd like to add.

@bradfox2 What are we supposed to do after these changes? How is the model retrained?

Fine-tune the model on your specific text corpus. Model weights are tuned during initial pretraining with the tokenized vocabulary, so you need to keep the same token mapped to the same input 'node'. The first 1000 tokens are meaningless and the model learns to essentially ignore them. Give the meaningless vocab some relevance with a custom dataset, continue finetuning and model will start to give the previously ignored tokens/vocab some weight (pun intended)

Sep 07 '19 05:09 bradfox2

@samreenkazi. I ended up using Spacy to make a list of all the words in a portion of the corpus. There is easy built in functions for listing for instance the 10.000 most common words in the text. I then checked this against the bert vocab file, and ended up adding roughly 400 words in the empty spots in the vocab-file.

I did a few test, and on my very specific medical language, it seemed to have good effect. However, I noticed that it need quite a lot of pre-training to outperform the standard vocabulary. I trained a couple of days on a 2080Ti until it was better (logical since the weights for the new vocab is initialised from scratch).

I am not sure if this answers your question with the urdu corpus. However, if you like to have a look at the script I used for building the vocab-file, just send me a message.

hey i would like to have a look at the script can you help?

Nov 06 '19 06:11 mahanswaray

I did a few more tests on this (as I mentioned in another post). I am no longer convinced about my own results. The challenge is that fine-tuning has a lot of variance. I think the first positive result was mainly "a fluke". Even if it gives a marginal improvement, it also adds more complexity (how many words, what words etc).

Domain specific pre-training is essential for getting this models to perform good on specialised domains, but the extra words in the dictionary is just a tiny detail that probably is not worth the effort.

On Wed, 6 Nov 2019 at 07:21, mahanswaray [email protected] wrote:

@samreenkazi https://github.com/samreenkazi. I ended up using Spacy to make a list of all the words in a portion of the corpus. There is easy built in functions for listing for instance the 10.000 most common words in the text. I then checked this against the bert vocab file, and ended up adding roughly 400 words in the empty spots in the vocab-file.

I did a few test, and on my very specific medical language, it seemed to have good effect. However, I noticed that it need quite a lot of pre-training to outperform the standard vocabulary. I trained a couple of days on a 2080Ti until it was better (logical since the weights for the new vocab is initialised from scratch).

I am not sure if this answers your question with the urdu corpus. However, if you like to have a look at the script I used for building the vocab-file, just send me a message.

hey i would like to have a look at the script can you help?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/google-research/bert/issues/396?email_source=notifications&email_token=ACFIYAA7HNLV5X6Z3POZ3D3QSJO6FA5CNFSM4GSBNJ3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDFNIII#issuecomment-550163489, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACFIYAAPQBNN7SRHBXDS573QSJO6FANCNFSM4GSBNJ3A .

Nov 06 '19 09:11 peregilk

@peregilk hi I also have an interest in the medical field and I have the same problem, did you have any progress?

Jan 01 '20 15:01 flaviofafe1414

I have similar question as above @peregilk , how to add domain specific vocab.txt in any language other then english, in their official repo it "This repository does not include code for learning a new WordPiece vocabulary , there are a number of open source options available. However, keep in mind that these are not compatible with our tokenization.py library:" then how do we learn domain specific vocab, as with availbile multilingual pretraining weights bert didnt perform well on downstream classfication task on urdu corpus. or we can create workaround for domain specific vocab by modifying ~1000 lines of vocab.txt as suggested by @bradfox2

Hello, Did you come up with any solution for this ? I have my own custom tokenizer and it has a lot of new words.

Jan 10 '20 05:01 muhammadfahid51

I did a few more tests on this (as I mentioned in another post). I am no longer convinced about my own results. The challenge is that fine-tuning has a lot of variance. I think the first positive result was mainly "a fluke". Even if it gives a marginal improvement, it also adds more complexity (how many words, what words etc). Domain specific pre-training is essential for getting this models to perform good on specialised domains, but the extra words in the dictionary is just a tiny detail that probably is not worth the effort. … On Wed, 6 Nov 2019 at 07:21, mahanswaray @.***> wrote: @samreenkazi https://github.com/samreenkazi. I ended up using Spacy to make a list of all the words in a portion of the corpus. There is easy built in functions for listing for instance the 10.000 most common words in the text. I then checked this against the bert vocab file, and ended up adding roughly 400 words in the empty spots in the vocab-file. I did a few test, and on my very specific medical language, it seemed to have good effect. However, I noticed that it need quite a lot of pre-training to outperform the standard vocabulary. I trained a couple of days on a 2080Ti until it was better (logical since the weights for the new vocab is initialised from scratch). I am not sure if this answers your question with the urdu corpus. However, if you like to have a look at the script I used for building the vocab-file, just send me a message. hey i would like to have a look at the script can you help? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#396?email_source=notifications&email_token=ACFIYAA7HNLV5X6Z3POZ3D3QSJO6FA5CNFSM4GSBNJ3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDFNIII#issuecomment-550163489>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACFIYAAPQBNN7SRHBXDS573QSJO6FANCNFSM4GSBNJ3A .

@peregilk One thing that I am confused about BERT works on character level or Word level ? What I mean here is does bert break the word token further into characters during training and learn embedding accordingly or does it considers only the word tokens(made by tokenizer). Why I am asking this is that vocab.txt contains all the basic characters of my language say urdu. By basic characters I mean a,b,c characters of my language. Anyone please enlighten me on this ? Let's say we have our new data but that data is also made of those basic characters right. If it was only about the word tokens then English vocabulary size is more than 120k and the model vocab don't have all of them in it.

Jan 10 '20 05:01 muhammadfahid51

@muhammadfahid51 If I understand things correctly, Bert works on token level. In addition it learns multi-token embeddings.

Lets say we have the word "goodness". Lets say this does not exist in the vocabulary. However, the following tokens exists: "good" "ness" "##ness"

Since this is one word, it will be tokenized as "good"+"##ness". Bert will learn an embedding for ("good"-"##ness") as well as embedding for both "good" and "##ness". It is not as good as if "goodness" existed directly, but reasonable.

However: Adding extra words is a bit double edged. If you have a domain specific vocabulary and add "goodness" to one of the empty spots, you will have to train from a random weight. Both "good" and "##ness" have OK embeddings already, and even if it never in the training set have seen "goodness" before, it already have a reasonable embedding to start from.

If you training the entire network from scratch, it makes more sense to build a vocabulary that is as efficient as possible, ie require as few tokens as possible.

I hope this answers your question.

Jan 10 '20 10:01 peregilk

@muhammadfahid51 If I understand things correctly, Bert works on token level. In addition it learns multi-token embeddings.

Lets say we have the word "goodness". Lets say this does not exist in the vocabulary. However, the following tokens exists: "good" "ness" "##ness"

Since this is one word, it will be tokenized as "good"+"##ness". Bert will learn an embedding for ("good"-"##ness") as well as embedding for both "good" and "##ness". It is not as good as if "goodness" existed directly, but reasonable.

However: Adding extra words is a bit double edged. If you have a domain specific vocabulary and add "goodness" to one of the empty spots, you will have to train from a random weight. Both "good" and "##ness" have OK embeddings already, and even if it never in the training set have seen "goodness" before, it already have a reasonable embedding to start from.

If you training the entire network from scratch, it makes more sense to build a vocabulary that is as efficient as possible, ie require as few tokens as possible.

I hope this answers your question.

@peregilk What if I want to pre-train on my a custom language say turkish ? Can I replace some other language characters in vocab.txt with my own tokens ?

And also to pre-train bert from scratch, how much data is required ?

Jan 10 '20 10:01 muhammadfahid51

@muhammadfahid51 Dont interpret any of this as "correct" answers. I am just another researcher struggling with the same issues.

You can use Sentencepiece to build any vocabulary from scratch. Sentencepiece will search your corpus and find the most efficient tokens.

If you want to start from pre-trained weights you will have to use the same vocabulary. You can manipulate that vocabulary, but it is really only manipulating the open spots that makes most sense. In most cases it will be easier (and cheaper) to start with multi-lang pre-trained Bert and train with additional data in your target language, than to train from scratch on a separate language (this is if your language is part of multi-lang Bert already - 100 languages). My experience.

I would say a reasonable corpus is 1B words. Multilang-Bert is trained with a bit less for each of the languages. Size of training corpus and training time is really a big issue. Fine-tuning the vocabulary is (IMHO) not really something you should spend too much time on.

Jan 10 '20 11:01 peregilk

@muhammadfahid51 Take a look at this page: https://github.com/google-research/bert/blob/master/multilingual.md#list-of-languages

Jan 10 '20 11:01 peregilk

I'm in the same boat, I need to add a larger vocabulary (not directly present in the original bert vocab) but I want to also use the init checkpoint from the original bert. When you do that, you quickly run into this: ValueError: Shape of variable bert/embeddings/word_embeddings:0 ((new_vocab_size, 768)) doesn't match with shape of tensor bert/embeddings/word_embeddings ([30522, 768]) from checkpoint reader. On some preliminary investigation it seems like a few code changes are required to handle this case in run_pretraining.py where the bert_config vocab size is different from the init checkpoint file. As long as the order of the initial vocab terms is kept the same in the new vocab, it should be possible to initialize the known token weights from the pretrained weights and randomly initialize the rest of the network.

Feb 14 '20 22:02 dhruvsakalley

@peregilk Good afternoon, and thank you so much for your comprehensive responses. I would like to ask you a small question, you say: "Bert will learn an embedding for ("good"-"##ness") as well as embedding for both "good" and "##ness"." What do you mean by embedding for ("good"-"##ness")? Probably I am mistaken, however, I thought as any NLP model Bert has embeddings only for single tokens. Do you mean Bert will learn the interrelation between these two tokens in attention layer or does it have special embeddings for such cases? Thanks in advance!

Jun 05 '20 10:06 Aktsvigun

I'm in the same boat, I need to add a larger vocabulary (not directly present in the original bert vocab) but I want to also use the init checkpoint from the original bert. When you do that, you quickly run into this: ValueError: Shape of variable bert/embeddings/word_embeddings:0 ((new_vocab_size, 768)) doesn't match with shape of tensor bert/embeddings/word_embeddings ([30522, 768]) from checkpoint reader. On some preliminary investigation it seems like a few code changes are required to handle this case in run_pretraining.py where the bert_config vocab size is different from the init checkpoint file. As long as the order of the initial vocab terms is kept the same in the new vocab, it should be possible to initialize the known token weights from the pretrained weights and randomly initialize the rest of the network.

@dhruvsakalley I like your idea and wonder if you managed to implement it. Can you share your experience, please?

Jun 10 '20 14:06 boggis30

@peregilk Can u say me how to train the model after adding our words in vocab.txt ? Code of how to train BERT With additional vocabulary ?

Jun 26 '20 17:06 ali4friends71

@dhruvsakalley I am exactly doing the same what you're trying to implement. May I know whether it is implemented or not ? are you able to increase the vocab file ? If yes, can you please share the code. Thanks

Jul 17 '20 07:07 SravaniSegireddy

@SravaniSegireddy I implemented it. But it is of no use as the words are divided into 2 and the meanings are changing..

Jul 18 '20 05:07 ali4friends71

@ali4friends71 could you please share your code, if possble. May be i can get some idea. Thanks.

Jul 20 '20 13:07 SravaniSegireddy

@irhallac Let me post an update on my experiences with using vocab-files during pretraining on a domain specific corpus. As far as I know, the only reasonable way to test if this work is to validate it by also fine-tuning the pretrained networks. You will have to do this multiple times before you get reliable results.

My initial experiments indicated that adding custom words to the vocab-file had some effects. However, at least on my corpus that can be described as "medical tweets", this effect just disappears after running the domain specific pretraining for a while.

After spending quite some time on this, I have ended up dropping the custom vocab-files totally. Bert seems to be able to learn these specialised words by tokenizing them.

@peregilk Does it mean, you can pretrain the model with domain specifc data but not necessary to change the vocab file is it. may i know what is the accuracy improvement you achieved after pretraining with domain specific data ?

Jul 20 '20 13:07 SravaniSegireddy

Absolutely. Doing additional domain specific pretraining is very effective. How effective will depend on your task and corpus.

Lots of examples of its efficiancy. Here is just one example: https://arxiv.org/pdf/2005.07503

Fundamental changes to the vocab will make it impossible to continue from pretrained weights. Unless you are training a completely new language, and have lots of resources, this is probably just a bad idea.

Using the open spots is the only alternative. My experience is that on my corpus it has not been a very big thing. BERT learns new composite words very easy.

On Mon, 20 Jul 2020 at 15:45, SRAVANI SEGIREDDY [email protected] wrote:

@irhallac https://github.com/irhallac Let me post an update on my experiences with using vocab-files during pretraining on a domain specific corpus. As far as I know, the only reasonable way to test if this work is to validate it by also fine-tuning the pretrained networks. You will have to do this multiple times before you get reliable results.

My initial experiments indicated that adding custom words to the vocab-file had some effects. However, at least on my corpus that can be described as "medical tweets", this effect just disappears after running the domain specific pretraining for a while.

After spending quite some time on this, I have ended up dropping the custom vocab-files totally. Bert seems to be able to learn these specialised words by tokenizing them.

@peregilk https://github.com/peregilk Does it mean, you can pretrain the model with domain specifc data but not necessary to change the vocab file is it. may i know what is the accuracy improvement you achieved after pretraining with domain specific data ?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/google-research/bert/issues/396#issuecomment-661049072, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACFIYAHZ6KHVVJKZWCEGUO3R4RC7ZANCNFSM4GSBNJ3A .

Jul 20 '20 17:07 peregilk

@SravaniSegireddy you could use the code from Colab notebook.Check out the article for further instructions.

Jul 21 '20 04:07 ali4friends71

But do we really need to manually add domain specific (out of vocab words?). Isn't the purpose with word pieces that they can in theory construct new words with combining their pieces? And if so, the semantics of the wordpieces would change when doing downstream tasks?

Nov 02 '20 11:11 timpal0l

If I use new vocabulary, can I initialize the it's embeddings using the average obtained from their subword parts? And how can I introduce it in the model?

Dec 17 '20 10:12 joancf

@irhallac it is the [unusedXXX]-tokens that can be replaced with any word you like. I am running some experiments on how effective this really is, but from my understanding you should prioritise words that are frequent in the domain specific dictionary, not in vocab.txt, and very the current tokenization is unlikely to be any good. You should also take into account that Bert is very good in tokenizing long words.

Lets say you have an english football specific corpus. You notice that the the word "footballs" is not in vocab.txt. It is however meaningless to add it. Bert tokenizes this as "football"+"##s" -> [2375] [2016] (look at the line numbers in vocab.txt), and have already learned a very good representation both for the individual tokens and the combination. However your text is a lot about football stadiums, and you see that "Anfield" is not in the vocab.txt. This will be tokenized as "an"+"##field". However, it is reason to believe that the current learned word embedding is not very useful.

If you add "anfield" to one of the unused spots in vocab.txt, and then to pre-training from the last checkpoint, this vectorised embedding will just start from random(think "0") and it might learn the word faster since it will not be confused by other uses of the tokens [an] and [##field]. This is my understanding of how this works.

It is important to remember that the line-numbers in vocab-txt matters. For instance is "!" at line #1000, and it should be also after you edited the file. For pre-training from existing checkpoint you should not change the size of the vocab-file. This means you have around 1000 extra words for your disposal. For pre-training from scratch, build a new vocab based only on your corpus.

Since Bert does an excellent job in tokenising and learning this combinations, do not expect dramatic improvements by adding words to the vocab. In my experience adding very specific terms, like common long medical latin words, have some effect. Adding words like "footballs" will likely just have negative effects since the current vector is already pretty good.

@peregilk thanks, that helps. to learn domain specific word embedding, any clues on volume of domain specific corpus needed, assuming we continue pretraining from the released model as checkpoint.

Feb 01 '21 05:02 nagads

@nagads I understand your question, and have gotten it several times before. I usually answers them "Ill tell you, if you first can tell me what a boat costs!".

It really depends on a lot of things, like how good would you like the model to be, how different is the domain from what the original model is trained on, how much data is possible to get at what cost etc. There are a few tricks (like dynamic masking) that you can use to make the most out of your data.

In general transformer models require A LOT of text. Always. However, domain specific pre-training is probably the part where you are able to get reasonable results with moderate amounts of data.

Feb 03 '21 10:02 peregilk

@nagads I understand your question, and have gotten it several times before. I usually answers them "Ill tell you, if you first can tell me what a boat costs!".

It really depends on a lot of things, like how good would you like the model to be, how different is the domain from what the original model is trained on, how much data is possible to get at what cost etc. There are a few tricks (like dynamic masking) that you can use to make the most out of your data.

In general transformer models require A LOT of text. Always. However, domain specific pre-training is probably the part where you are able to get reasonable results with moderate amounts of data.

@peregilk thanks for the insightful response.

Feb 04 '21 06:02 nagads

Pretrain from scratch or modify the first ~1000 lines of the vocab.txt file with the vocab you'd like to add.

In addition to these methods, we can add our own additional vocabulary and add a tensor embedding for the additional vocabulary , concatenate with original embedding tensor for tokens in the original vocab file. Specific details in https://github.com/google-research/bert/issues/82#issuecomment-921613967

Sep 18 '21 02:09 Yiwen-Yang-666

bert bert copied to clipboard

How to use my own additional vocabulary dictionary?

bert
bert copied to clipboard