minbpe icon indicating copy to clipboard operation
minbpe copied to clipboard

Loading data from disk partially

Open kathir-ks opened this issue 1 year ago • 2 comments

Training the tokenizer is memory intensive. It needs hundreds of GBs of RAM to train a tokenizer. What about using memmap to load only the required portion of the data from the disk?. Since mostly the access is sequential in nature, I would be significantly faster compared to random access. This would significantly reduce the memory required with a trade off with training time.

kathir-ks avatar Feb 17 '24 17:02 kathir-ks

Yeah definitely, an optimized version of the code (that does not yet exist) would absolutely have to worry about this.

karpathy avatar Feb 17 '24 17:02 karpathy

The approach would be to load a part of the txt file (depending upon the ram available) and write the merged pairs to another file and replace the earlier version.

kathir-ks avatar Feb 17 '24 17:02 kathir-ks