tokenizers Tokenizer training sticks for long times.

When training a byelevelBPE for a text file of size 1.8GB, it takes about 5 hours for 90% of the count pairs, but hung for 2 hours. The process keeps running, but not all cpus are used. The code for training the tokeniser is

tokennizer = ByteLevelBPETokenizer(lowercase=True)
tokeniser.train(text_path,
                         min_frequencey=2,
                         show_progress=True,
                         special_tokens={"<s>", "<pad>", "</s>", "<unk>", "mask"}
                        )

Can anyone help me on pointing out what might cause this issue? Thanks!

Oct 08 '21 09:10 lancelotzty

Did you hit your RAM limit and it start to use swap ? That could be an explanation. Did you try on 1Go long dataset see if that fits?

Oct 08 '21 10:10 Narsil

I think I'm having a similar issue. I'm using EC2 (96 vCPU; 768 GiB). After the "Count pairs" step it just hangs.

datasets 1.14.0 huggingface-hub 0.0.19 tokenizers 0.10.3 transformers 4.11.3

Oct 21 '21 19:10 KatieGoz

Can you also provide a memory snapshot when it blocks ? (top or htop for instance).

45Go can require quite a lot more memory (depends on data) so if you have 64Go it can defnitely hit swap and start being hogged.

Oct 22 '21 07:10 Narsil

I don't think it's spilling over into swap (took this snapshot just after "Count pairs" completed): Incidentally, processing times today are much longer than they were yesterday, though the data and ec2 instance have not changed. It takes a little while, but eventually the progress meter advances to the "Compute merged" stage. Seems to be working today.

Oct 22 '21 15:10 KatieGoz

@KatieGoz Probably unrelated, but did the read speed really switch from 7mn to 2h48 ?? That seems pretty off.

Also the compute merges is actually starting this time, but it extremely slow that's right ?

Can you share a reproducible script, that could help figure it out.

Actually looking at the code, I can see that after computing the pairs, all of them are inserted in a BinaryHeap, without being shown in a progress bar, so it's actually possible that it's all that's happening, BinaryHeap insertion is O(ln(n)) and there are n of them so it's another nln(n) step that doesn't have it's own progress bar.

With a script we could confirm/infirm this.

Oct 22 '21 16:10 Narsil

The speed for pre-processing did increase from ~7 mins to nearly 3 hrs. It does seem odd; the file is being read directly from the specified directory, and never from cache, right?

The compute merge did start this time, but it took a few minutes for that progress bar to appear. (Good to know that the BinaryHeap insertion process likely explains the delay.)

I can't share the dataset, therefore I can't exactly share a reproducible script. That said, my results today differed from yesterday, so I'm not sure how reproducible the issue is. (The json dataset contains a single field with 353,384,837 sentences, if that's any help.)

Thanks for your help troubleshooting this!

Oct 22 '21 17:10 KatieGoz

There is no cache no. I don't see why there would be such a difference if your code has not changed. Could be also underlying hardware ?

If the issue is not reproducible, it's going to be hard to fix tbh. (The binary heap thing can be modified to be both faster and more explicit most likely)

Oct 22 '21 17:10 Narsil

I'm using an ml.r5.24xlarge AWS instance, so the issue could have been a temporary problem with my instance.

Oct 22 '21 17:10 KatieGoz

I tried to reproduce your issue on english data (big.txt within the test files repeated many times). However, the binary heap was extremely fast, compared to other operations (1 order of magnitude).

Without an explicit benchmark we can run against to make sure that the modifications are improving across the board, I am hesitant to do any modifications to the BPE algorithm.

Oct 25 '21 10:10 Narsil

Same here, I'm using AWS x1e.32xlarge (3,904 GB memory), training SentencePieceBPE on 180GB corpus, it just hang for ~8 hours on Count pairs on the last 1% (the progress bar froze completely):

[08:42:09] Pre-processing files (194139 Mo)         ��������������������������������������������������������������������������������                100%
[01:11:12] Tokenize words                           �������������������������������������������������������������������������������� 700608638/700608638
[05:27:56] Count pairs                              �������������������������������������������������������������������������������� 693602514/700608638

htop result: Screen Shot 2022-04-16 at 14 25 22

Apr 16 '22 06:04 dszhengyu

Hi @dszhengyu ,

Thanks for the report.

180Go is likely to trigger some bug where we overflow the u32 count method ( I can't be certain it will trigger, just really look at the result tokenization as if something overflows it might be silent and just ruin your tokenizer). There's no easy fix for that, as making the lib purely u64 is going to slow it down for many people and supporting both is a big change to do.
Thanks a lot for the top screen. It's using only 1 core instead of 64, which could be the reason of this "stuck" behavior.
It doesn't seem to be swap related.

The first issue needs to be addressed in a separate PR if you want to get a good tokenizer (no overflow), then we can a look at the slow step you're seeing, Then we can parallelize the loop for the heap see if that's the issue.

Is your data English ? Or a space separated langage at least (not Chinese for instance). If your pre_tokenization is not very efficient then there might be other issues related to BPE itself which doesn't really handle long sequences that well (without spaces, sequences can be quite long quite fast).

Apr 19 '22 10:04 Narsil

Same issue here... it (BpeTrainer) stuck on "Compute merges", while consuming all 64 threads for a long time...

Nov 09 '23 00:11 saberiato

Same problem, stuck on compute merges for 10+hrs when all the other steps took ~ 3hrs. Dataset size if 300Gb, English. CPU is at 1000%

Nov 14 '23 10:11 frotaur

Is there any update on this issue? Seems like BPE tokenizers get stuck at Computing merges on long training corpus.

Jan 01 '24 21:01 EndlessCG

#1313 and https://github.com/huggingface/tokenizers/issues/1206#issuecomment-1496022146 should give some ideas.

Jan 03 '24 09:01 ArthurZucker

I also have the issue of slow training speed with the tokenizer on smaller datasets. Upon investigation, it became clear that the tokenizer only utilizes 1 CPU core, and batching or not batching doesn’t affect its speed. What do you think is the solution to this problem?

Mar 04 '24 20:03 MahdiMasoon

Could you share a reproducer? If you have os.environ["TOKENIZERS_PARALLELISM"] = "1" ( in python) or if you have the correct ray package for rust / depends on the tokenizer you are using as well and the version you are using

Mar 05 '24 08:03 ArthurZucker

Could you share a reproducer? If you have os.environ["TOKENIZERS_PARALLELISM"] = "1" ( in python) or if you have the correct ray package for rust / depends on the tokenizer you are using as well and the version you are using

Thank you for your guidance, it’s fixed!

Mar 06 '24 04:03 MahdiMasoon

Could you share a reproducer? If you have os.environ["TOKENIZERS_PARALLELISM"] = "1" ( in python) or if you have the correct ray package for rust / depends on the tokenizer you are using as well and the version you are using

Thank you for your guidance, it’s fixed!

Do you mind sharing your solution? How were you able to fix the issue? Thanks.

Mar 26 '24 23:03 tehranixyz

Checking on this issue, can someone share a workaround or insight into this problem?

Apr 10 '24 09:04 sinaahmadi

just run os.environ["TOKENIZERS_PARALLELISM"] = "1" . it's weird but the BPE trainer has issues with parallelism.

Apr 18 '24 16:04 SoshyHayami

just run os.environ["TOKENIZERS_PARALLELISM"] = "1" . it's weird but the BPE trainer has issues with parallelism.

@tehranixyz @sinaahmadi, I apologize for the delay in responding. As @SoshyHayami mentioned, executing this command will resolve the issue, and this process will run in parallel.

Apr 19 '24 09:04 MahdiMasoon

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

May 20 '24 01:05 github-actions[bot]

tokenizers tokenizers copied to clipboard

Tokenizer training sticks for long times.

tokenizers
tokenizers copied to clipboard