I can't train a tokenizer from pre-tokenized word counts in Python
If I have a large text corpus I want to use for training a tokenizer, I'd rather not (and sometimes can't) move all the files to one machine for training. Instead, I'd like to use a distributed framework like pyspark to generate word counts using my chosen pre-tokenizer, and then provide these word counts to a trainer.
To enable this, could we add a new train_from_counts method to the Python trainers that accepts a dictionary or list of (word, frequency) tuples in addition to the train and train_from_iterator methods?
I personally don't see any issue, but this will work only with BPE, so it won't be a general method. Would you want to tackle this PR ?
Pinging @n1t0 for more insights.
@Narsil I would love to also see this for Unigram, but instead of word counts, you just provide words (i.e. the seed vocabulary). So I think you could generalise it to some degree, i.e. models can take some seed or compute it themselves, but the seed is model dependent?
Yes the seed is model dependant, and I think that's opening a bit too much the encapsulation of this library. Especially maintaining it in all the bindings is going to be quite tedious (and prevents some use of clever typing should be need it internally).
- Is is because preprocessing is too slow in any manner that you want to provide you own seeds ? Or using some kind of cache of it ?
- Ideally we need to identify the core need for this, there could be a different way than just making the functions reachable that would solve the problem and leave the "implementation detail" of the "seed" part hidden from users (to let it evolve as needed)
The reason I want to do it is due to the "technically wrong" way that the sentence piece algorithm enumerates all the common substrings, (discussed here).
Specifically, the SA algorithm requires each string to be separated by a different EOS character or it doesn't work properly - it doesn't return all the longest common substrings of the individual words taking into account the word boundaries. For my corpus, this massively skews the statistics for tokens that occur at the start/end of sentences.
is there any way to train_from_word_counts as of now?
Still not at the moment. You can open a PR if you wish.
The main point of concerns would be that this would bypass entirely pre_tokenizer, normalizer, post_processor and the like. Meaning there would be potential issues in the resulting vocabulary and those items leading to odd tokenization.
If this function contains a clear warning about this though (at least in the doc) it could be included.
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.