tokenizers
tokenizers copied to clipboard
Tokenizer training sticks for long times.
When training a byelevelBPE for a text file of size 1.8GB, it takes about 5 hours for 90% of the count pairs, but hung for 2 hours. The process keeps running, but not all cpus are used. The code for training the tokeniser is
tokennizer = ByteLevelBPETokenizer(lowercase=True)
tokeniser.train(text_path,
min_frequencey=2,
show_progress=True,
special_tokens={"<s>", "<pad>", "</s>", "<unk>", "mask"}
)

Did you hit your RAM limit and it start to use swap ? That could be an explanation. Did you try on 1Go
long dataset see if that fits?
I think I'm having a similar issue. I'm using EC2 (96 vCPU; 768 GiB). After the "Count pairs" step it just hangs.
datasets 1.14.0 huggingface-hub 0.0.19 tokenizers 0.10.3 transformers 4.11.3
Can you also provide a memory snapshot when it blocks ? (top
or htop
for instance).
45Go can require quite a lot more memory (depends on data) so if you have 64Go it can defnitely hit swap and start being hogged.
I don't think it's spilling over into swap (took this snapshot just after "Count pairs" completed):
Incidentally, processing times today are much longer than they were yesterday, though the data and ec2 instance have not changed. It takes a little while, but eventually the progress meter advances to the "Compute merged" stage. Seems to be working today.
@KatieGoz Probably unrelated, but did the read speed really switch from 7mn to 2h48 ?? That seems pretty off.
Also the compute merges is actually starting this time, but it extremely slow that's right ?
Can you share a reproducible script, that could help figure it out.
Actually looking at the code, I can see that after computing the pairs, all of them are inserted in a BinaryHeap, without being shown in a progress bar, so it's actually possible that it's all that's happening, BinaryHeap insertion is O(ln(n)) and there are n of them so it's another nln(n) step that doesn't have it's own progress bar.
With a script we could confirm/infirm this.
The speed for pre-processing did increase from ~7 mins to nearly 3 hrs. It does seem odd; the file is being read directly from the specified directory, and never from cache, right?
The compute merge did start this time, but it took a few minutes for that progress bar to appear. (Good to know that the BinaryHeap insertion process likely explains the delay.)
I can't share the dataset, therefore I can't exactly share a reproducible script. That said, my results today differed from yesterday, so I'm not sure how reproducible the issue is. (The json dataset contains a single field with 353,384,837 sentences, if that's any help.)
Thanks for your help troubleshooting this!
There is no cache no. I don't see why there would be such a difference if your code has not changed. Could be also underlying hardware ?
If the issue is not reproducible, it's going to be hard to fix tbh. (The binary heap thing can be modified to be both faster and more explicit most likely)
I'm using an ml.r5.24xlarge AWS instance, so the issue could have been a temporary problem with my instance.
I tried to reproduce your issue on english data (big.txt
within the test files repeated many times). However, the binary heap was extremely fast, compared to other operations (1 order of magnitude).
Without an explicit benchmark we can run against to make sure that the modifications are improving across the board, I am hesitant to do any modifications to the BPE algorithm.
Same here, I'm using AWS x1e.32xlarge (3,904 GB memory), training SentencePieceBPE on 180GB corpus, it just hang for ~8 hours on Count pairs
on the last 1% (the progress bar froze completely):
[08:42:09] Pre-processing files (194139 Mo) �������������������������������������������������������������������������������� 100%
[01:11:12] Tokenize words �������������������������������������������������������������������������������� 700608638/700608638
[05:27:56] Count pairs �������������������������������������������������������������������������������� 693602514/700608638
htop result:
Hi @dszhengyu ,
Thanks for the report.
- 180Go is likely to trigger some bug where we overflow the
u32
count method ( I can't be certain it will trigger, just really look at the result tokenization as if something overflows it might be silent and just ruin your tokenizer). There's no easy fix for that, as making the lib purelyu64
is going to slow it down for many people and supporting both is a big change to do. - Thanks a lot for the
top
screen. It's using only 1 core instead of 64, which could be the reason of this "stuck" behavior. - It doesn't seem to be swap related.
The first issue needs to be addressed in a separate PR if you want to get a good tokenizer (no overflow), then we can a look at the slow step you're seeing, Then we can parallelize the loop for the heap see if that's the issue.
Is your data English ? Or a space separated langage at least (not Chinese for instance). If your pre_tokenization is not very efficient then there might be other issues related to BPE itself which doesn't really handle long sequences that well (without spaces, sequences can be quite long quite fast).
Same issue here... it (BpeTrainer
) stuck on "Compute merges", while consuming all 64 threads for a long time...
Same problem, stuck on compute merges for 10+hrs when all the other steps took ~ 3hrs. Dataset size if 300Gb, English. CPU is at 1000%
Is there any update on this issue? Seems like BPE tokenizers get stuck at Computing merges
on long training corpus.
#1313 and https://github.com/huggingface/tokenizers/issues/1206#issuecomment-1496022146 should give some ideas.
I also have the issue of slow training speed with the tokenizer on smaller datasets. Upon investigation, it became clear that the tokenizer only utilizes 1 CPU core, and batching or not batching doesn’t affect its speed. What do you think is the solution to this problem?
Could you share a reproducer? If you have os.environ["TOKENIZERS_PARALLELISM"] = "1"
( in python) or if you have the correct ray package for rust / depends on the tokenizer you are using as well and the version you are using
Could you share a reproducer? If you have
os.environ["TOKENIZERS_PARALLELISM"] = "1"
( in python) or if you have the correct ray package for rust / depends on the tokenizer you are using as well and the version you are using
Thank you for your guidance, it’s fixed!
Could you share a reproducer? If you have
os.environ["TOKENIZERS_PARALLELISM"] = "1"
( in python) or if you have the correct ray package for rust / depends on the tokenizer you are using as well and the version you are usingThank you for your guidance, it’s fixed!
Do you mind sharing your solution? How were you able to fix the issue? Thanks.
Checking on this issue, can someone share a workaround or insight into this problem?
just run os.environ["TOKENIZERS_PARALLELISM"] = "1" . it's weird but the BPE trainer has issues with parallelism.
just run os.environ["TOKENIZERS_PARALLELISM"] = "1" . it's weird but the BPE trainer has issues with parallelism.
@tehranixyz @sinaahmadi, I apologize for the delay in responding. As @SoshyHayami mentioned, executing this command will resolve the issue, and this process will run in parallel.
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.