tokenizers
tokenizers copied to clipboard
"Solution" to memory hogging in train_new_from_iterator with a hack
Hi
So I was training a new tokenizer from Llama Tokenizer (meta-llama/Llama-2-7b-hf), on a medium sized corpus (Fineweb-10BT sample : 15 million documents with average length of 2300 characters). After the first step of "Pre-processing sequences", the "tokenize words" step would take 1+ hour and I ran out of RAM (780GB). I distinctly remember that when I trained similar sized (but different) corpus few days back, this step would take only around 1 minute.
After going through all the help I could find on internet here, here, and here, and changing the server (upgrading RAM) multiple times, nothing worked. Finally I found that I had used a different old-tokenizer "meta-llama/Meta-Llama-3-8B" in my previous runs. Changed it and everything started working with same procesing time (~1 mnt) and no memory hogging.
Not exactly sure why this matters, but putting it here for someone more experienced to look into it and hopefully it helps someone.