tokenizers icon indicating copy to clipboard operation
tokenizers copied to clipboard

Out of memory error while training tokenizer

Open kouohhashi opened this issue 4 years ago • 3 comments

Hi,

When I tried to train a tokenizer, I got out of memory error.

datasets is entireJapanese wikipedia and about 5.1G. My server has 64G memory.

Screen Shot 2020-09-18 at 4 08 43 PM

ByteLevelBPETokenizer and SentencePieceBPETokenizer caused out of memory error. But CharBPETokenizer and BertWordPieceTokenizer were okay.

I want to train SentencePieceBPETokenizer if possible.

Is there any options to train SentencePieceBPETokenizer without memory errors? Or can I convert SentencePiece model and vocab for Huggingface tokenizer?

My environment is ubuntu18.04. Python 3.7.7

Thanks in advance.

kouohhashi avatar Sep 18 '20 07:09 kouohhashi

Hi,

What version of tokenizers are you running ?

BPE algorithm can be quite memory intensive when the length of the tokens is large, which can be the case in japanese because of no spaces. We also are making some changes to lower the current memory footprint of some preprocessing.

  1. You could try to add some PreTokenizer to split you incoming sentences into relevant groups to try and lower the size of the chunks that get fed to BPE. A simple way would be to preprocess your data and put each split on different lines.
  2. In the not so far future, you will be able to train with SentencePiece which notably behaves better on languages that don't have spaces. It's still very bleeding edge.
  3. Our memory footprint will be lower in 0.9 so if you can afford to wait that's a solution.
  4. If you can't afford to wait and methods above don't work, please tell us, I'm willing to help to get it working with bleeding edge stuff.

Narsil avatar Sep 18 '20 09:09 Narsil

My tokenizers version is 0.8.1.rc2 .

And thank you so much for your response. I'll definitely try the solution.

Thanks

kouohhashi avatar Sep 19 '20 01:09 kouohhashi

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar May 14 '24 01:05 github-actions[bot]

Has your problem been resolved?

musexiaoluo avatar Aug 30 '24 07:08 musexiaoluo