Per E Kummervold

Results 58 comments of Per E Kummervold

@irhallac it is the [unusedXXX]-tokens that can be replaced with any word you like. I am running some experiments on how effective this really is, but from my understanding you...

They are in two bulks. From line#2 [unused0] to line #100[unused98]. Then there are 4 tokens that absolutely should not be changed [UNK]|CLS][SEP][MASK]. Then they continue. line #105 to line...

OK. I did not know. Then it is only the uncased version that has 1000 unused spots.

@irhallac Let me post an update on my experiences with using vocab-files during pretraining on a domain specific corpus. As far as I know, the only reasonable way to test...

I did a few more tests on this (as I mentioned in another post). I am no longer convinced about my own results. The challenge is that fine-tuning has a...

@muhammadfahid51 If I understand things correctly, Bert works on token level. In addition it learns multi-token embeddings. Lets say we have the word "goodness". Lets say this does not exist...

@muhammadfahid51 Dont interpret any of this as "correct" answers. I am just another researcher struggling with the same issues. You can use Sentencepiece to build any vocabulary from scratch. Sentencepiece...

@muhammadfahid51 Take a look at this page: https://github.com/google-research/bert/blob/master/multilingual.md#list-of-languages

Absolutely. Doing additional domain specific pretraining is very effective. How effective will depend on your task and corpus. Lots of examples of its efficiancy. Here is just one example: https://arxiv.org/pdf/2005.07503...

@nagads I understand your question, and have gotten it several times before. I usually answers them "Ill tell you, if you first can tell me what a boat costs!". It...