tokenizers Improve BPE Training without PreTokenizer

If we don't set a PreTokenizer, the BPE algorithm will run on non-segmented text. This usually works fine (just slower) when processing files with short lines since this gives some sort of pre-segmentation and limits the size of the words.

But in some cases (e.g. a JSON file formatted on a single line) we end up with extremely long words and training the BPE would just take so long that this isn't practicable.

We should add a word length limit with a default value high enough to work in most cases, while letting the user know when it's likely to take forever.

Nov 03 '20 14:11 n1t0

Being the trigger for this issue, I may suggest that issuing a warning instead might be easier and more useful. I just had no idea because I was copy pasting.

default value high enough to work in most cases

Sounds like a way to inflict torture on one poor soul some time in the future, what is high enough?

Instead, maybe a check of how long the first few lines are if a pre tokenizer wasn't set and issue a warning

Hey. You didn't use a pre-tokenizer but your lines are long. Training is going to be slow AF. If you know what you're doing, keep at it, otherwise kill this process and start again with a pre-tokenizer

Nov 04 '20 18:11 talolard

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

May 10 '24 01:05 github-actions[bot]