tokenizers
tokenizers copied to clipboard
LLamaTokenizer with `use_fast=True` / and `use_fast=False` causing memory leak when used with multiprocessing / `dataset.map(num_proc)`
When running a dataset.map with num_proc=16, I am unable to tokenize a ~45GB dataset on a machine with >200GB Vram. The dataset consists of ~30000 rows with a string of 120-180k characters.
The memory linearly increases until it reaches max with 200GB, after just 2000 such iterations / 2000 lines..
Other things I have tried:
- I have tried creating e.g.
16 tokenizersin global scope and accessing them via therankparameter. gc.collect'- not usage of
use_fastmakes the script more efficent - it takes now ~10k lines instead of 2k to go OOM' - use of AutoTokenzier,
Reproduction script
import datasets
from transformers import LlamaTokenizerFast, AutoTokenizer
import gc
N_PROCS = 16
tokenizer_tinyllama = None
def tokenize(example, rank: int = 0):
global tokenizer_tinyllama
# gc.collect()
if tokenizer_tinyllama is None:
tokenizer_tinyllama = AutoTokenizer.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0", use_fast=True)
example["input_ids"] = tokenizer_tinyllama(example["content"], max_length=None)["input_ids"]
example["n_tokens"] = len(example["input_ids"])
example["content"] = None
return example
def main():
books3 = datasets.load_dataset("michael/set3_128k", streaming=False, keep_in_memory=False) # jsonl file, around 45GB in jsonl
# books3 = books3.shuffle()
books3_updated = books3["train"].map(
tokenize,
num_proc=N_PROCS,
with_rank=True,
)
books3_updated.push_to_hub(
"michael/books3_128k_tokenized"
)
if __name__ == "__main__":
main()
Env
OS: Ubuntu 22.04
PIP freeze
aiohttp==3.9.4
aiosignal==1.3.1
async-timeout==4.0.3
attrs==21.2.0
Automat==20.2.0
Babel==2.8.0
bcrypt==3.2.0
blinker==1.4
certifi==2020.6.20
chardet==4.0.0
click==8.0.3
cloud-init==23.4.4
colorama==0.4.4
command-not-found==0.3
configobj==5.0.6
constantly==15.1.0
cryptography==3.4.8
datasets==2.18.0
dbus-python==1.2.18
decorator==4.4.2
devscripts===2.22.1ubuntu1
dill==0.3.8
distro==1.7.0
distro-info==1.1+ubuntu0.2
filelock==3.13.4
frozenlist==1.4.1
fsspec==2024.2.0
gpg==1.16.0
hf_transfer==0.1.6
httplib2==0.20.2
huggingface-hub==0.22.2
hyperlink==21.0.0
idna==3.3
importlib-metadata==4.6.4
incremental==21.3.0
jeepney==0.7.1
Jinja2==3.0.3
jsonpatch==1.32
jsonpointer==2.0
jsonschema==3.2.0
keyring==23.5.0
launchpadlib==1.10.16
lazr.restfulclient==0.14.4
lazr.uri==1.0.6
MarkupSafe==2.0.1
more-itertools==8.10.0
multidict==6.0.5
multiprocess==0.70.16
netifaces==0.11.0
numpy==1.26.4
oauthlib==3.2.0
packaging==24.0
pandas==2.2.2
pexpect==4.8.0
protobuf==5.26.1
ptyprocess==0.7.0
pyarrow==15.0.2
pyarrow-hotfix==0.6
pyasn1==0.4.8
pyasn1-modules==0.2.1
PyGObject==3.42.1
PyHamcrest==2.0.2
PyJWT==2.3.0
pyOpenSSL==21.0.0
pyparsing==2.4.7
pyrsistent==0.18.1
pyserial==3.5
python-apt==2.4.0+ubuntu3
python-dateutil==2.9.0.post0
python-debian==0.1.43+ubuntu1.1
python-linux-procfs==0.6.3
python-magic==0.4.24
pytz==2022.1
pyudev==0.22.0
pyxdg==0.27
PyYAML==5.4.1
regex==2023.12.25
requests==2.25.1
safetensors==0.4.3
screen-resolution-extra==0.0.0
SecretStorage==3.3.1
sentencepiece==0.2.0
service-identity==18.1.0
six==1.16.0
sos==4.5.6
ssh-import-id==5.11
systemd-python==234
tokenizers==0.15.2
tqdm==4.66.2
transformers==4.39.3
Twisted==22.1.0
typing_extensions==4.11.0
tzdata==2024.1
ubuntu-advantage-tools==8001
ufw==0.36.1
unattended-upgrades==0.1
unidiff==0.5.5
urllib3==1.26.5
wadllib==1.3.6
xdg==5
xkit==0.0.0
xxhash==3.4.1
yarl==1.9.4
zipp==1.0.0
zope.interface==5.4.0
Update, the following function does not seem to have such a behavior.
def tokenize(example, rank: int = 0):
# global tokenizer_tinyllama
gc.collect()
# chat = [
# {"role": "user", "content": book},
# ]
# tokens = tokenizer_tinyllama.apply_chat_template(chat, tokenize=True)
# if tokenizer_tinyllama is None:
tokenizer_tinyllama = LlamaTokenizerFast.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0", use_fast=True)
example["input_ids"] = tokenizer_tinyllama(example["content"], max_length=None)["input_ids"]
example["n_tokens"] = len(example["input_ids"])
example["content"] = None
return example
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
No, not stale!
I also encounter a similar issue with 0.19.1.
Opened a new issue with a more general reproduction, I believe this is a more common problem.
Same issue here.
Thanks all for these. Is the issue more with AutoTokenizer than LlamaTokenizerFast ?
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
Not stale
Have not had the time to tackle this yet, but keeping an eye on it!
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
Not stale