LLaMA-Factory icon indicating copy to clipboard operation
LLaMA-Factory copied to clipboard

Running tokenizer on dataset 速度逐渐变慢

Open xuyue1112 opened this issue 1 year ago • 1 comments

Reminder

  • [X] I have read the README and searched the existing issues.

System Info

  • llamafactory version: 0.9.1.dev0
  • Platform: Linux-5.15.120.bsk.2-amd64-x86_64-with-glibc2.31
  • Python version: 3.11.2
  • PyTorch version: 2.4.0 (GPU)
  • Transformers version: 4.45.0.dev0
  • Datasets version: 2.21.0
  • Accelerate version: 0.34.2
  • PEFT version: 0.12.0
  • TRL version: 0.9.6
  • GPU type: NVIDIA A800-SXM4-40GB

Reproduction

dataset

dataset: xxx eval_dataset: xxx template: qwen2_vl cutoff_len: 4096 max_samples: 5000000 overwrite_cache: true preprocessing_num_workers: 16

Expected behavior

训练过程中,Running tokenizer on dataset 的速度逐渐从 几百 samples/s 下降到 个位数。 请教下可能是哪里有问题?

Others

xuyue1112 avatar Sep 15 '24 13:09 xuyue1112

经过我的实际测试,#5458 应该解决了这个问题

AlongWY avatar Sep 18 '24 14:09 AlongWY

@AlongWY 我也遇到了同样的问题,但你这个应该是针对packing情况的,如果没有packing应该怎么改呢

经过我的实际测试,#5458 应该解决了这个问题

Wiselnn570 avatar Oct 26 '24 11:10 Wiselnn570

没有 packing 也会下降到个位数吗?按理说应该不会吧

AlongWY avatar Oct 28 '24 09:10 AlongWY