OOM and slow tokenization after upgrade LLaMA-Factory
Reminder
- [x] I have read the above rules and searched the existing issues.
System Info
old llamafactory version: a71e6850211b6e54c07b1900cdcd7f52f7832629 new llamafactory version: cf1087d409b2513338f9f32a33b6d95919b96d91
transformers==4.51.3
Reproduction
TORCHRUN \
src/train.py \
--stage sft \
--mask_history true \
--do_train \
--finetuning_type full \
--deepspeed examples/deepspeed/ds_z3_config.json \
--model_name_or_path "Qwen2.5/Qwen2.5-VL-7B-Instruct" \
--image_max_pixels 262144 \
--video_max_pixels 16384 \
--video_maxlen 32 \
--trust_remote_code \
--dataset .... \
--template qwen2_vl \
--cutoff_len 32768 \
--overwrite_cache \
--dataloader_num_workers 0 \
--preprocessing_num_workers 128 \
--output_dir .... \
--logging_steps 1 \
--save_steps 200 \
--tokenized_path "saves/datasets/0709" \
--overwrite_output_dir \
--save_only_model \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 8 \
--learning_rate 1e-5 \
--weight_decay 0.1 \
--max_grad_norm 1.0 \
--adam_beta1 0.9 \
--adam_beta2 0.95 \
--gradient_checkpointing \
--enable_liger_kernel \
--use_unsloth_gc \
--flash_attn fa2 \
--num_train_epochs 3.0 \
--lr_scheduler_type "cosine_with_min_lr" \
--lr_scheduler_kwargs '{"min_lr":1e-6}' \
--warmup_steps 10 \
--bf16 \
--ddp_timeout 180000000
Others
The training is ok with commit a71e6850211b6e54c07b1900cdcd7f52f7832629. However, after I upgrade llamafactory (commit id: cf1087d409b2513338f9f32a33b6d95919b96d91), I observe two issues:
- The tokenization becomes very slow, the old version uses about 20mins, but the new version takes more than 2 hours.
- Training OOM, training 7b model with 32 A100 gives OOM after running about 60 iterations (no OOM at the beginning of the training).
Any idea on the cause of these issues?
What do you mean by 32 A100?
What do you mean by 32 A100?
32 NVIDIA A100 GPUs.
What's the size of your dataset, also the GPU size?
I also found the same issue. The tokenization took much longer time than older version. Previously it was less than 3 minutes, and now it takes 1 hour.
Running tokenizer on dataset (num_proc=32): 6%|██████████▍ | 24000/391529 [39:37<Running tokenizer on dataset (num_proc=32): 6%|██████████▉ | 25000/391529 [40:35<Running tokenizer on dataset (num_proc=32): 7%|███████████▍ | 26000/391529 [40:50Running tokenizer on dataset (num_proc=32): 7%|███████████▊ | 27000/391529 [40:59Running tokenizer on dataset (num_proc=32): 7%|███████████▊ | 27000/391529 [41:14Running tokenizer on dataset (num_proc=32): 7%|████████████▎ | 28000/391529 [41:42Running tokenizer on dataset (num_proc=32): 7%|████████████▎ | 28000/391529 [41:54Running tokenizer on dataset (num_proc=32): 7%|████████████▋ | 29000/391529 [42:01Running tokenizer on dataset (num_proc=32): 7%|████████████▋ | 29000/391529 [42:14Running tokenizer on dataset (num_proc=32): 8%|█████████████▏ | 30000/391529 [42:20Running tokenizer on dataset (num_proc=32): 8%|█████████████▏ | 30000/391529 [42:34Running tokenizer on dataset (num_proc=32): 8%|█████████████▌ | 31000/391529 [58:29<Running tokenizer on dataset (num_proc=32): 8%|█████████████▊ | 32000/391529 [1:03:36<Running tokenizer on dataset (num_proc=32): 8%|██████████████▏ | 33000/391529 [1:03:55<Running tokenizer on dataset (num_proc=32): 8%|██████████████▏ | 33000/391529 [1:04:14<Running tokenizer on dataset (num_proc=32): 9%|██████████████▋ | 34000/391529 [1:05:15<Running tokenizer on dataset (num_proc=32): 9%|██████████████▋ | 34000/391529 [1:05:34<Running tokenizer on dataset (num_proc=32): 9%|███████████████ | 35000/391529 [1:06:45<Running tokenizer on dataset (num_proc=32): 9%|███████████████ | 35000/391529 [1:07:04<Running tokenizer on dataset (num_proc=32): 9%|███████████████▌ | 36000/391529 [1:08:15<Running tokenizer on dataset (num_proc=32): 9%|███████████████▌ | 36000/391529 [1:08:34<Running tokenizer on dataset (num_proc=32): 9%|███████████████▉ | 37000/391529 [1:11:36<Running tokenizer on dataset (num_proc=32): 10%|████████████████▍ | 38000/391529 [1:12:36<Running tokenizer on dataset (num_proc=32): 10%|████████████████▍ | 38000/391529 [1:12:54<Running tokenizer on dataset (num_proc=32): 10%|████████████████▉ | 39000/391529 [1:12:55Running tokenizer on dataset (num_proc=32): 10%|████████████████▉ | 39000/391529 [1:13:14Running tokenizer on dataset (num_proc=32): 10%|█████████████████▎ | 40000/391529 [1:14:16Running tokenizer on dataset (num_proc=32): 10%|█████████████████▎ | 40000/391529 [1:14:34Running tokenizer on dataset (num_proc=32): 10%|█████████████████▊ | 41000/391529 [1:15:55Running tokenizer on dataset (num_proc=32): 10%|█████████████████▊ | 41000/391529 [1:16:14Running tokenizer on dataset (num_proc=32): 11%|██████████████████▏ | 42000/391529 [1:17:45Running tokenizer on dataset (num_proc=32): 11%|██████████████████▌
This is how slow it is.
++1
Same problem, now 'Running tokenizer on dataset' is very slow.
might same, tracking this.
same problem @hiyouga
I’ve encountered the same issue. Keeping all other conditions the same, the step "Running tokenizer on dataset" is much slower in version 0.9.4dev compared to 0.9.2dev — we’re talking about a difference of minutes versus hours or even more than ten hours.
same problem +1
same problem +1
same problem +1
Could you mitigate this problem by deleting these lines? https://github.com/hiyouga/LLaMA-Factory/blob/2b66b4df43cfd8cdee5130e44270135113902569/src/llamafactory/cli.py#L157-L159
same problem
Could you mitigate this problem by deleting these lines?
LLaMA-Factory/src/llamafactory/cli.py
Lines 157 to 159 in 2b66b4d from multiprocessing import freeze_support
freeze_support()
I tried it but it still too slow
Could you mitigate this problem by deleting these lines?
LLaMA-Factory/src/llamafactory/cli.py
Lines 157 to 159 in 2b66b4d
from multiprocessing import freeze_support
freeze_support()
Doesn't work.
same problem
I've resolved this bug by setting use_fast_tokenizer: false.
After carefully comparing the dependency versions in the environments for v0.9.2 and v0.9.4, I suspect the issue is caused by a newer version of the transformers library.
My investigation of the logs supports this theory. In the previous working version (v0.9.2), the following warning message would appear by default, indicating the use of a "slow" processor:
Using a slow image processor as use_fast is unset and a slow processor was saved with this model. use_fast=True will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with use_fast=False.
This suggests that the new default "fast" processor behavior in a recent transformers update is the root cause. Forcing it to use the legacy "slow" processor by setting the flag to false fixes the issue.
For context, the model I am using is qwen2.5-vl-32b.
@ZhouZineng Thanks for your feedback! We'll investigate this issue
I've resolved this bug by setting
use_fast_tokenizer: false.After carefully comparing the dependency versions in the environments for v0.9.2 and v0.9.4, I suspect the issue is caused by a newer version of the
transformerslibrary.My investigation of the logs supports this theory. In the previous working version (v0.9.2), the following warning message would appear by default, indicating the use of a "slow" processor:
Using a slow image processor as use_fast is unset and a slow processor was saved with this model. use_fast=True will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with use_fast=False.This suggests that the new default "fast" processor behavior in a recent
transformersupdate is the root cause. Forcing it to use the legacy "slow" processor by setting the flag tofalsefixes the issue.For context, the model I am using is
qwen2.5-vl-32b.
I am using Qwen2.5-1.5B for SFT, --no_use_fast_tokenizer makes it slower than --use_fast_tokenizer
@ZhouZineng Thanks for your feedback! We'll investigate this issue感谢您的反馈!我们会调查此问题
Any conclusion or progress? Thanks for your work!
Any conclusion or progress?
Converting format of dataset: 46876 examples [00:01, 13100.96 examples/s]
Running tokenizer on dataset: 13%|█████████▋ | 3000/23438 [17:59<2:03:36, 2.76 examples/s]