LLaMA-Factory OOM and slow tokenization after upgrade LLaMA-Factory

Reminder

[x] I have read the above rules and searched the existing issues.

System Info

old llamafactory version: a71e6850211b6e54c07b1900cdcd7f52f7832629 new llamafactory version: cf1087d409b2513338f9f32a33b6d95919b96d91

transformers==4.51.3

Reproduction

TORCHRUN \
    src/train.py \
    --stage sft \
    --mask_history true \
    --do_train \
    --finetuning_type full \
    --deepspeed examples/deepspeed/ds_z3_config.json \
    --model_name_or_path "Qwen2.5/Qwen2.5-VL-7B-Instruct" \
    --image_max_pixels 262144 \
    --video_max_pixels 16384 \
    --video_maxlen 32 \
    --trust_remote_code \
    --dataset .... \
    --template qwen2_vl \
    --cutoff_len 32768 \
    --overwrite_cache \
    --dataloader_num_workers 0 \
    --preprocessing_num_workers 128 \
    --output_dir .... \
    --logging_steps 1 \
    --save_steps 200 \
    --tokenized_path "saves/datasets/0709" \
    --overwrite_output_dir \
    --save_only_model \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 8 \
    --learning_rate 1e-5 \
    --weight_decay 0.1 \
    --max_grad_norm 1.0 \
    --adam_beta1 0.9 \
    --adam_beta2 0.95 \
    --gradient_checkpointing \
    --enable_liger_kernel \
    --use_unsloth_gc \
    --flash_attn fa2 \
    --num_train_epochs 3.0 \
    --lr_scheduler_type "cosine_with_min_lr" \
    --lr_scheduler_kwargs '{"min_lr":1e-6}' \
    --warmup_steps 10 \
    --bf16 \
    --ddp_timeout 180000000

Others

The training is ok with commit a71e6850211b6e54c07b1900cdcd7f52f7832629. However, after I upgrade llamafactory (commit id: cf1087d409b2513338f9f32a33b6d95919b96d91), I observe two issues:

The tokenization becomes very slow, the old version uses about 20mins, but the new version takes more than 2 hours.
Training OOM, training 7b model with 32 A100 gives OOM after running about 60 iterations (no OOM at the beginning of the training).

Any idea on the cause of these issues?

Jul 10 '25 02:07 fanghgit

What do you mean by 32 A100?

Jul 10 '25 06:07 Mahak001

What do you mean by 32 A100?

32 NVIDIA A100 GPUs.

Jul 10 '25 06:07 fanghgit

What's the size of your dataset, also the GPU size?

Jul 10 '25 07:07 Mahak001

I also found the same issue. The tokenization took much longer time than older version. Previously it was less than 3 minutes, and now it takes 1 hour.

Jul 11 '25 09:07 zilinwang-1529

Running tokenizer on dataset (num_proc=32): 6%|██████████▍ | 24000/391529 [39:37<Running tokenizer on dataset (num_proc=32): 6%|██████████▉ | 25000/391529 [40:35<Running tokenizer on dataset (num_proc=32): 7%|███████████▍ | 26000/391529 [40:50Running tokenizer on dataset (num_proc=32): 7%|███████████▊ | 27000/391529 [40:59Running tokenizer on dataset (num_proc=32): 7%|███████████▊ | 27000/391529 [41:14Running tokenizer on dataset (num_proc=32): 7%|████████████▎ | 28000/391529 [41:42Running tokenizer on dataset (num_proc=32): 7%|████████████▎ | 28000/391529 [41:54Running tokenizer on dataset (num_proc=32): 7%|████████████▋ | 29000/391529 [42:01Running tokenizer on dataset (num_proc=32): 7%|████████████▋ | 29000/391529 [42:14Running tokenizer on dataset (num_proc=32): 8%|█████████████▏ | 30000/391529 [42:20Running tokenizer on dataset (num_proc=32): 8%|█████████████▏ | 30000/391529 [42:34Running tokenizer on dataset (num_proc=32): 8%|█████████████▌ | 31000/391529 [58:29<Running tokenizer on dataset (num_proc=32): 8%|█████████████▊ | 32000/391529 [1:03:36<Running tokenizer on dataset (num_proc=32): 8%|██████████████▏ | 33000/391529 [1:03:55<Running tokenizer on dataset (num_proc=32): 8%|██████████████▏ | 33000/391529 [1:04:14<Running tokenizer on dataset (num_proc=32): 9%|██████████████▋ | 34000/391529 [1:05:15<Running tokenizer on dataset (num_proc=32): 9%|██████████████▋ | 34000/391529 [1:05:34<Running tokenizer on dataset (num_proc=32): 9%|███████████████ | 35000/391529 [1:06:45<Running tokenizer on dataset (num_proc=32): 9%|███████████████ | 35000/391529 [1:07:04<Running tokenizer on dataset (num_proc=32): 9%|███████████████▌ | 36000/391529 [1:08:15<Running tokenizer on dataset (num_proc=32): 9%|███████████████▌ | 36000/391529 [1:08:34<Running tokenizer on dataset (num_proc=32): 9%|███████████████▉ | 37000/391529 [1:11:36<Running tokenizer on dataset (num_proc=32): 10%|████████████████▍ | 38000/391529 [1:12:36<Running tokenizer on dataset (num_proc=32): 10%|████████████████▍ | 38000/391529 [1:12:54<Running tokenizer on dataset (num_proc=32): 10%|████████████████▉ | 39000/391529 [1:12:55Running tokenizer on dataset (num_proc=32): 10%|████████████████▉ | 39000/391529 [1:13:14Running tokenizer on dataset (num_proc=32): 10%|█████████████████▎ | 40000/391529 [1:14:16Running tokenizer on dataset (num_proc=32): 10%|█████████████████▎ | 40000/391529 [1:14:34Running tokenizer on dataset (num_proc=32): 10%|█████████████████▊ | 41000/391529 [1:15:55Running tokenizer on dataset (num_proc=32): 10%|█████████████████▊ | 41000/391529 [1:16:14Running tokenizer on dataset (num_proc=32): 11%|██████████████████▏ | 42000/391529 [1:17:45Running tokenizer on dataset (num_proc=32): 11%|██████████████████▌

This is how slow it is.

Jul 16 '25 22:07 jsu-sc

++1

Jul 18 '25 09:07 zekizz

Same problem, now 'Running tokenizer on dataset' is very slow.

Jul 22 '25 01:07 wingvortex

might same, tracking this.

Jul 24 '25 09:07 enerai

same problem @hiyouga

Jul 30 '25 05:07 blofn

I’ve encountered the same issue. Keeping all other conditions the same, the step "Running tokenizer on dataset" is much slower in version 0.9.4dev compared to 0.9.2dev — we’re talking about a difference of minutes versus hours or even more than ten hours.

Aug 10 '25 04:08 guxuanjun

same problem +1

Aug 14 '25 01:08 ultrakunp

same problem +1

Aug 14 '25 06:08 kelenlv

same problem +1

Aug 14 '25 18:08 xiaosean

Could you mitigate this problem by deleting these lines? https://github.com/hiyouga/LLaMA-Factory/blob/2b66b4df43cfd8cdee5130e44270135113902569/src/llamafactory/cli.py#L157-L159

Aug 15 '25 11:08 hiyouga

same problem

Aug 17 '25 11:08 HeShiLie

Could you mitigate this problem by deleting these lines?

LLaMA-Factory/src/llamafactory/cli.py

Lines 157 to 159 in 2b66b4d from multiprocessing import freeze_support

freeze_support()

I tried it but it still too slow

Aug 17 '25 12:08 HeShiLie

Could you mitigate this problem by deleting these lines?

LLaMA-Factory/src/llamafactory/cli.py

Lines 157 to 159 in 2b66b4d

from multiprocessing import freeze_support

freeze_support()

Doesn't work.

Aug 18 '25 01:08 guxuanjun

same problem

Aug 18 '25 10:08 wj1tr0y

I've resolved this bug by setting use_fast_tokenizer: false.

After carefully comparing the dependency versions in the environments for v0.9.2 and v0.9.4, I suspect the issue is caused by a newer version of the transformers library.

My investigation of the logs supports this theory. In the previous working version (v0.9.2), the following warning message would appear by default, indicating the use of a "slow" processor:

Using a slow image processor as use_fast is unset and a slow processor was saved with this model. use_fast=True will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with use_fast=False.

This suggests that the new default "fast" processor behavior in a recent transformers update is the root cause. Forcing it to use the legacy "slow" processor by setting the flag to false fixes the issue.

For context, the model I am using is qwen2.5-vl-32b.

Aug 27 '25 07:08 ZhouZineng

@ZhouZineng Thanks for your feedback! We'll investigate this issue

Aug 27 '25 12:08 hiyouga

I've resolved this bug by setting use_fast_tokenizer: false.

After carefully comparing the dependency versions in the environments for v0.9.2 and v0.9.4, I suspect the issue is caused by a newer version of the transformers library.

My investigation of the logs supports this theory. In the previous working version (v0.9.2), the following warning message would appear by default, indicating the use of a "slow" processor:
Using a slow image processor as use_fast is unset and a slow processor was saved with this model. use_fast=True will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with use_fast=False.
This suggests that the new default "fast" processor behavior in a recent transformers update is the root cause. Forcing it to use the legacy "slow" processor by setting the flag to false fixes the issue.

For context, the model I am using is qwen2.5-vl-32b.

I am using Qwen2.5-1.5B for SFT, --no_use_fast_tokenizer makes it slower than --use_fast_tokenizer

Sep 02 '25 04:09 enerai

@ZhouZineng Thanks for your feedback! We'll investigate this issue感谢您的反馈！我们会调查此问题

Any conclusion or progress? Thanks for your work!

Sep 08 '25 14:09 beep-bebop

Any conclusion or progress?

Sep 17 '25 15:09 MengHao666

Converting format of dataset: 46876 examples [00:01, 13100.96 examples/s]
Running tokenizer on dataset: 13%|█████████▋ | 3000/23438 [17:59<2:03:36, 2.76 examples/s]

Oct 28 '25 13:10 arit2