tokenizers LLamaTokenizer with `use_fast=True` / and `use_fast=False` causing memory leak when used with multiprocessing / `dataset.map(num

When running a dataset.map with num_proc=16, I am unable to tokenize a ~45GB dataset on a machine with >200GB Vram. The dataset consists of ~30000 rows with a string of 120-180k characters.

The memory linearly increases until it reaches max with 200GB, after just 2000 such iterations / 2000 lines..

Other things I have tried:

I have tried creating e.g. 16 tokenizers in global scope and accessing them via the rank parameter.
gc.collect'
not usage of use_fast makes the script more efficent - it takes now ~10k lines instead of 2k to go OOM'
use of AutoTokenzier,

Reproduction script

import datasets
from transformers import LlamaTokenizerFast, AutoTokenizer
import gc
N_PROCS = 16

tokenizer_tinyllama = None

def tokenize(example, rank: int = 0):
    global tokenizer_tinyllama
    
    # gc.collect()
    if tokenizer_tinyllama is None:
        tokenizer_tinyllama = AutoTokenizer.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0", use_fast=True)
    
    example["input_ids"] =  tokenizer_tinyllama(example["content"], max_length=None)["input_ids"]
    example["n_tokens"] = len(example["input_ids"])
    example["content"] = None
    return example

def main():
    
    books3 = datasets.load_dataset("michael/set3_128k", streaming=False, keep_in_memory=False) # jsonl file, around 45GB in jsonl
    # books3 = books3.shuffle()
    
    books3_updated = books3["train"].map(
        tokenize,
        num_proc=N_PROCS,
        with_rank=True,
    )
    books3_updated.push_to_hub(
        "michael/books3_128k_tokenized"
    )
    
        
if __name__ == "__main__":
    main()

Env

OS: Ubuntu 22.04

PIP freeze

aiohttp==3.9.4
aiosignal==1.3.1
async-timeout==4.0.3
attrs==21.2.0
Automat==20.2.0
Babel==2.8.0
bcrypt==3.2.0
blinker==1.4
certifi==2020.6.20
chardet==4.0.0
click==8.0.3
cloud-init==23.4.4
colorama==0.4.4
command-not-found==0.3
configobj==5.0.6
constantly==15.1.0
cryptography==3.4.8
datasets==2.18.0
dbus-python==1.2.18
decorator==4.4.2
devscripts===2.22.1ubuntu1
dill==0.3.8
distro==1.7.0
distro-info==1.1+ubuntu0.2
filelock==3.13.4
frozenlist==1.4.1
fsspec==2024.2.0
gpg==1.16.0
hf_transfer==0.1.6
httplib2==0.20.2
huggingface-hub==0.22.2
hyperlink==21.0.0
idna==3.3
importlib-metadata==4.6.4
incremental==21.3.0
jeepney==0.7.1
Jinja2==3.0.3
jsonpatch==1.32
jsonpointer==2.0
jsonschema==3.2.0
keyring==23.5.0
launchpadlib==1.10.16
lazr.restfulclient==0.14.4
lazr.uri==1.0.6
MarkupSafe==2.0.1
more-itertools==8.10.0
multidict==6.0.5
multiprocess==0.70.16
netifaces==0.11.0
numpy==1.26.4
oauthlib==3.2.0
packaging==24.0
pandas==2.2.2
pexpect==4.8.0
protobuf==5.26.1
ptyprocess==0.7.0
pyarrow==15.0.2
pyarrow-hotfix==0.6
pyasn1==0.4.8
pyasn1-modules==0.2.1
PyGObject==3.42.1
PyHamcrest==2.0.2
PyJWT==2.3.0
pyOpenSSL==21.0.0
pyparsing==2.4.7
pyrsistent==0.18.1
pyserial==3.5
python-apt==2.4.0+ubuntu3
python-dateutil==2.9.0.post0
python-debian==0.1.43+ubuntu1.1
python-linux-procfs==0.6.3
python-magic==0.4.24
pytz==2022.1
pyudev==0.22.0
pyxdg==0.27
PyYAML==5.4.1
regex==2023.12.25
requests==2.25.1
safetensors==0.4.3
screen-resolution-extra==0.0.0
SecretStorage==3.3.1
sentencepiece==0.2.0
service-identity==18.1.0
six==1.16.0
sos==4.5.6
ssh-import-id==5.11
systemd-python==234
tokenizers==0.15.2
tqdm==4.66.2
transformers==4.39.3
Twisted==22.1.0
typing_extensions==4.11.0
tzdata==2024.1
ubuntu-advantage-tools==8001
ufw==0.36.1
unattended-upgrades==0.1
unidiff==0.5.5
urllib3==1.26.5
wadllib==1.3.6
xdg==5
xkit==0.0.0
xxhash==3.4.1
yarl==1.9.4
zipp==1.0.0
zope.interface==5.4.0

Apr 15 '24 21:04 michaelfeil

Update, the following function does not seem to have such a behavior.

def tokenize(example, rank: int = 0):
    # global tokenizer_tinyllama
    
    gc.collect()
    # chat = [
    #     {"role": "user", "content": book},
    # ]    
    # tokens = tokenizer_tinyllama.apply_chat_template(chat, tokenize=True)
    # if tokenizer_tinyllama is None:
    tokenizer_tinyllama = LlamaTokenizerFast.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0", use_fast=True)
    
    example["input_ids"] =  tokenizer_tinyllama(example["content"], max_length=None)["input_ids"]
    example["n_tokens"] = len(example["input_ids"])
    example["content"] = None
    return example

Apr 15 '24 21:04 michaelfeil

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

May 16 '24 01:05 github-actions[bot]

No, not stale!

May 16 '24 03:05 michaelfeil

I also encounter a similar issue with 0.19.1.

May 22 '24 13:05 noamgai21

Opened a new issue with a more general reproduction, I believe this is a more common problem.

May 23 '24 06:05 noamgai21

Same issue here.

Jun 03 '24 22:06 soldni

Thanks all for these. Is the issue more with AutoTokenizer than LlamaTokenizerFast ?

Jun 05 '24 07:06 ArthurZucker

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

Jul 06 '24 01:07 github-actions[bot]

Not stale

Jul 06 '24 05:07 michaelfeil

Have not had the time to tackle this yet, but keeping an eye on it!

Jul 26 '24 10:07 ArthurZucker

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

Aug 26 '24 01:08 github-actions[bot]

Not stale

Aug 26 '24 15:08 michaelfeil

tokenizers
tokenizers copied to clipboard

LLamaTokenizer with `use_fast=True` / and `use_fast=False` causing memory leak when used with multiprocessing / `dataset.map(num_proc)`

Reproduction script

Env

tokenizers tokenizers copied to clipboard

LLamaTokenizer with `use_fast=True` / and `use_fast=False` causing memory leak when used with multiprocessing / `dataset.map(num_proc)`

Reproduction script

Env

tokenizers
tokenizers copied to clipboard