Per E Kummervold
Per E Kummervold
@irhallac it is the [unusedXXX]-tokens that can be replaced with any word you like. I am running some experiments on how effective this really is, but from my understanding you...
They are in two bulks. From line#2 [unused0] to line #100[unused98]. Then there are 4 tokens that absolutely should not be changed [UNK]|CLS][SEP][MASK]. Then they continue. line #105 to line...
OK. I did not know. Then it is only the uncased version that has 1000 unused spots.
@irhallac Let me post an update on my experiences with using vocab-files during pretraining on a domain specific corpus. As far as I know, the only reasonable way to test...
I did a few more tests on this (as I mentioned in another post). I am no longer convinced about my own results. The challenge is that fine-tuning has a...
@muhammadfahid51 If I understand things correctly, Bert works on token level. In addition it learns multi-token embeddings. Lets say we have the word "goodness". Lets say this does not exist...
@muhammadfahid51 Dont interpret any of this as "correct" answers. I am just another researcher struggling with the same issues. You can use Sentencepiece to build any vocabulary from scratch. Sentencepiece...
@muhammadfahid51 Take a look at this page: https://github.com/google-research/bert/blob/master/multilingual.md#list-of-languages
Absolutely. Doing additional domain specific pretraining is very effective. How effective will depend on your task and corpus. Lots of examples of its efficiancy. Here is just one example: https://arxiv.org/pdf/2005.07503...
@nagads I understand your question, and have gotten it several times before. I usually answers them "Ill tell you, if you first can tell me what a boat costs!". It...