Adding new terms into pre-trained model vocab | Issue in tokenizing OOV keywords

Open spate141 opened this issue 6 years ago • 0 comments

I've trained a tokenizer with 50k vocab and over 500M sentences. I'm in a situation where I'm encoding many keywords that contains OOV tokens which the tokenizer is doing not-so-good job in tokenizing. I was wondering if there's any way to perhaps introducing an option to allow users to modify the vocab after the tokenizer is trained. I've seen the issue where the discussion was to train a tokenizer on data that contains these oov terms in some range (1000?), so that the tokenizer can identify them during training and can add it them to the vocab. But the issue here is, there is no determined way to know which of these terms needs to included in training data! Any thoughts on how to handle such situations?

model.encode([
    '1997',
    '1998',
    '1996',
    '1999',
    '1994'
])

Generates following tokens:

[
    [137, 1],
    [137, 1], 
    [137, 1], 
    [137, 1], 
    [137, 1]
]

Feb 24 '20 21:02 spate141