tokenizers I can't train a tokenizer from pre-tokenized word counts in Python

If I have a large text corpus I want to use for training a tokenizer, I'd rather not (and sometimes can't) move all the files to one machine for training. Instead, I'd like to use a distributed framework like pyspark to generate word counts using my chosen pre-tokenizer, and then provide these word counts to a trainer.

To enable this, could we add a new train_from_counts method to the Python trainers that accepts a dictionary or list of (word, frequency) tuples in addition to the train and train_from_iterator methods?

Jul 19 '21 00:07 bskaggs

I personally don't see any issue, but this will work only with BPE, so it won't be a general method. Would you want to tackle this PR ?

Pinging @n1t0 for more insights.

Jul 19 '21 06:07 Narsil

@Narsil I would love to also see this for Unigram, but instead of word counts, you just provide words (i.e. the seed vocabulary). So I think you could generalise it to some degree, i.e. models can take some seed or compute it themselves, but the seed is model dependent?

Sep 02 '21 23:09 david-waterworth

Yes the seed is model dependant, and I think that's opening a bit too much the encapsulation of this library. Especially maintaining it in all the bindings is going to be quite tedious (and prevents some use of clever typing should be need it internally).

Is is because preprocessing is too slow in any manner that you want to provide you own seeds ? Or using some kind of cache of it ?
Ideally we need to identify the core need for this, there could be a different way than just making the functions reachable that would solve the problem and leave the "implementation detail" of the "seed" part hidden from users (to let it evolve as needed)

Sep 04 '21 07:09 Narsil

The reason I want to do it is due to the "technically wrong" way that the sentence piece algorithm enumerates all the common substrings, (discussed here).

Specifically, the SA algorithm requires each string to be separated by a different EOS character or it doesn't work properly - it doesn't return all the longest common substrings of the individual words taking into account the word boundaries. For my corpus, this massively skews the statistics for tokens that occur at the start/end of sentences.

Sep 04 '21 08:09 david-waterworth

is there any way to train_from_word_counts as of now?

Dec 24 '21 21:12 avi-otterai

Still not at the moment. You can open a PR if you wish.

The main point of concerns would be that this would bypass entirely pre_tokenizer, normalizer, post_processor and the like. Meaning there would be potential issues in the resulting vocabulary and those items leading to odd tokenization.

If this function contains a clear warning about this though (at least in the doc) it could be included.

Dec 28 '21 17:12 Narsil

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

Mar 26 '24 01:03 github-actions[bot]