tokenizers icon indicating copy to clipboard operation
tokenizers copied to clipboard

memory allocation of 21474836480 bytes failed

Open lostmsu opened this issue 4 months ago • 1 comments

I am unable to tokenize 1GB of text (enwik9 dataset) on 64GB machine regardless of TOKENIZERS_PARALLELISM. This happens when using Qwen3-4B-Base tokenizer.

I read the file into a str (Python) and call tokenizer.encode(text). After a couple minutes of struggle, I get

memory allocation of 21474836480 bytes failed

Config

tokenizers==0.21.4 Python 3.12.9 | packaged by Anaconda, Inc. | ... | [MSC v.1929 64 bit (AMD64)] on win32 Windows 11 24H2 64GB RAM

lostmsu avatar Aug 16 '25 17:08 lostmsu

Hey! Can you try to set the cache with tokenizer.resize_cache() to 0? (https://huggingface.co/docs/tokenizers/main/en/api/models#tokenizers.models.BPE.cache_capacity)

ArthurZucker avatar Sep 12 '25 15:09 ArthurZucker