YouTokenToMe icon indicating copy to clipboard operation
YouTokenToMe copied to clipboard

Tokenizing large corpus

Open quetz opened this issue 4 years ago • 2 comments

Right now tokenizer loads whole corpus in memory and it becomes an issue for large files.

Is it possible to read corpus file line-by-line or split it in any other way (while training as a whole)?

quetz avatar Nov 01 '20 00:11 quetz

No, there is no easy way to do it.

If the training data is so large that it does not fit into memory, then most likely you can subsample random sentences and this won't significantly affect the quality.

xbelonogov avatar Nov 01 '20 09:11 xbelonogov

Are you going to add encoding file-dataset? Now bpe.encode from list is working longer than bpe.train from file, isn't it odd? And bpe.train used less memory than bpe.encoding with full list loaded.

rrrepsac avatar Jun 29 '21 19:06 rrrepsac