Deduplication of text chunks with frequency count, training and encoding 5x speedup
In RegexTokenizer, the training text is initially split into chunks, and further processing is performed on individual chunks. This PR optimizes the process by retaining only unique chunks and their corresponding frequency counts. Practically this cuts the number of chunks to 1/7th, resulting in a training speedup of at least 5x.
Similar optimization for encode_ordinary(), where the tokenization of each string is cached. Also 5x speedup.
Good idea and I was able to reproduce this. I'll think about how I can maybe create an optimized version (that is still in Python land), which maybe prioritizes speed a bit more over simplicity / readability as a kind of middle ground.
I made a pure-python optimized version of this repo called BatchBPE that is thousands of times faster and can handle processing parquet files and scales to, for example, tokenizing all of FineWeb 10B on an m1. @karpathy would you like me to make a pull-request for a "notable fork"? It is different enough that it doesn't respect the educational purpose of this repo, which is why I didn't make an ordinary pull request for the changes. The primary speeds-ups are due to:
- Using a dictionary as suggested in the present pr
- Using a new approach to merging, which looks ahead for "safe merges" which makes merges in batches such that the roundtrip tokenization is unchanged despite merging at an average batch size of 200 merges on very large datasets.
- Fixing a performance bug in minbpe.