minbpe
minbpe copied to clipboard
Loading data from disk partially
Training the tokenizer is memory intensive. It needs hundreds of GBs of RAM to train a tokenizer. What about using memmap to load only the required portion of the data from the disk?. Since mostly the access is sequential in nature, I would be significantly faster compared to random access. This would significantly reduce the memory required with a trade off with training time.
Yeah definitely, an optimized version of the code (that does not yet exist) would absolutely have to worry about this.
The approach would be to load a part of the txt file (depending upon the ram available) and write the merged pairs to another file and replace the earlier version.