tokenizers
tokenizers copied to clipboard
memory allocation of 21474836480 bytes failed
I am unable to tokenize 1GB of text (enwik9 dataset) on 64GB machine regardless of TOKENIZERS_PARALLELISM. This happens when using Qwen3-4B-Base tokenizer.
I read the file into a str (Python) and call tokenizer.encode(text). After a couple minutes of struggle, I get
memory allocation of 21474836480 bytes failed
Config
tokenizers==0.21.4
Python 3.12.9 | packaged by Anaconda, Inc. | ... | [MSC v.1929 64 bit (AMD64)] on win32
Windows 11 24H2
64GB RAM
Hey! Can you try to set the cache with tokenizer.resize_cache() to 0? (https://huggingface.co/docs/tokenizers/main/en/api/models#tokenizers.models.BPE.cache_capacity)