tokenizers icon indicating copy to clipboard operation
tokenizers copied to clipboard

"Solution" to memory hogging in train_new_from_iterator with a hack

Open morphpiece opened this issue 8 months ago • 7 comments

Hi

So I was training a new tokenizer from Llama Tokenizer (meta-llama/Llama-2-7b-hf), on a medium sized corpus (Fineweb-10BT sample : 15 million documents with average length of 2300 characters). After the first step of "Pre-processing sequences", the "tokenize words" step would take 1+ hour and I ran out of RAM (780GB). I distinctly remember that when I trained similar sized (but different) corpus few days back, this step would take only around 1 minute.

After going through all the help I could find on internet here, here, and here, and changing the server (upgrading RAM) multiple times, nothing worked. Finally I found that I had used a different old-tokenizer "meta-llama/Meta-Llama-3-8B" in my previous runs. Changed it and everything started working with same procesing time (~1 mnt) and no memory hogging.

Not exactly sure why this matters, but putting it here for someone more experienced to look into it and hopefully it helps someone.

morphpiece avatar Jun 04 '24 19:06 morphpiece