LLaMA-Factory streaming 训练卡在第一个step

Reminder

[x] I have read the above rules and searched the existing issues.

System Info

我用15w规模的视频数据集训练qwenvl2.5-7b，因为处理太久采用streaming加载数据，现在卡在第一个step.

Reproduction

Put your message here.

Others

No response

Mar 12 '25 03:03 zfr00

dataset设置为### dataset dataset: structured_training_data # video: mllm_video_demo buffer_size: 128 preprocessing_batch_size: 64 streaming: true accelerator_config: dispatch_batches: false max_steps: 4000 template: qwen2_vl

Mar 12 '25 03:03 zfr00

试试 https://github.com/hiyouga/LLaMA-Factory/pull/7530 能不能解决问题在你的yaml里面加上

dataset_shards: 16
dataloader_num_workers: 16

把16换成你的cpu数量除以gpu数量

Mar 29 '25 11:03 aliencaocao

试试 #7530 能不能解决问题在你的yaml里面加上
dataset_shards: 16
dataloader_num_workers: 16
把16换成你的cpu数量除以gpu数量

我尝试了添加 'dataset_shards' 和 'dataloader_num_workers' 参数，但是会报 'dataset_shards' 无法解析的错误。我的qwen2_5vl_lora_sft.yaml文件配置为：

model

model_name_or_path: /root/autodl-tmp/weights/Qwen2.5-VL-7B-Instruct trust_remote_code: true

method

stage: sft do_train: true finetuning_type: lora lora_rank: 16 lora_alpha: 16 lora_dropout: 0.1 lora_target: all rope_scaling: yarn flash_attn: auto

dataset

dataset: mydata buffer_size: 128 preprocessing_batch_size: 128 template: qwen2_vl cutoff_len: 16384 streaming: true accelerator_config: dispatch_batches: false dataset_shards: 10 dataloader_num_workers: 10 overwrite_cache: true

train

per_device_train_batch_size: 1 gradient_accumulation_steps: 1 learning_rate: 5e-5 max_steps: 1000 lr_scheduler_type: cosine max_grad_norm: 8.0 warmup_steps: 0 packing: false bf16: true optim: adamw_torch include_num_input_tokens_seen: true

output

output_dir: saves/Qwen2.5-VL-7B-Instruct/lora/train_2025-03-30-13-03-57 report_to: none plot_loss: true logging_steps: 50 save_steps: 500 ddp_timeout: 180000000

Mar 30 '25 07:03 cui0711

我同样是需要处理较大的多模态数据集（单条数据包含20张图片，约7000条数据），不开启streaming会报CUDA OOM，开启后会卡在数据加载的阶段。

Mar 30 '25 07:03 cui0711

你有checkout我的commit吗用github cli 在项目目录下运行gh pr checkout 7530

Mar 30 '25 07:03 aliencaocao

你有checkout我的commit吗用github cli 在项目目录下运行gh pr checkout 7530

谢谢，之前的问题解决了，但是现在即使开启了streaming，仍然会报显存溢出的错误。问题可能出在哪呢？是因为数据集中图片太多了吗？每张图片尺寸为800x600.

Mar 30 '25 08:03 cui0711

你的batch size太大了，这和streaming与否没关系。我的pr解决的是卡第一个step1的问题

Mar 30 '25 08:03 aliencaocao

你要把cutoff_len缩小试试，到4096

Mar 30 '25 08:03 aliencaocao

你要把cutoff_len缩小试试，到4096

感谢！但是我缩小cutoff_len就会遇到超过最大输入token数量的问题，我只能尝试加大cutoff_len. 但是加大后就会CUDA OOM.

Mar 30 '25 09:03 cui0711

请问有什么解决办法吗？

Mar 30 '25 09:03 cui0711

我使用的是2*A800(80G), 但是每次加载完模型，每张卡各占用32G左右显存后就会报OOM错误，显存使用量远未达到80G。

Mar 30 '25 09:03 cui0711

你的单卡显存就是不够那么大的context length，缩小后超过最大token是个常见的妥协。要不然你就试试qlora，但是会更慢

Mar 30 '25 10:03 aliencaocao

同样的数据之前用LLaVA-NeXT-7B和他们官方的代码微调时显存不存在问题，但是总是爆内存，尤其是在保存检查点的时候。然后我来试一下LLamA-Factory，又开始爆显存，有点头疼了，哈哈哈

Mar 30 '25 10:03 cui0711

这个应该和OP不是一个问题了

Mar 30 '25 10:03 aliencaocao

试试 #7530 能不能解决问题在你的yaml里面加上
dataset_shards: 16
dataloader_num_workers: 16
把16换成你的cpu数量除以gpu数量

我改成这个设置以后还还会卡在第一个step。

Apr 02 '25 02:04 xiaolan98

在取一个batch的时候，会处理远超一个batch对应的data

Apr 02 '25 02:04 xiaolan98

在取一个batch的时候，会处理远超一个batch对应的data

你的buffer_size和preprocessing_batch_size设的是什么？

试试把buffer_size设成global batch size，preprocessing设成1

另外把TOKENIZERS_PARALLELISM=0加上

Apr 02 '25 06:04 aliencaocao

试试 #7530 能不能解决问题在你的yaml里面加上
dataset_shards: 16
dataloader_num_workers: 16
把16换成你的cpu数量除以gpu数量

大佬，这个真的解决了我一大难题。昨天花了20个小时tokenize所有数据，最后保存的时候出错，心态崩了。

想问一下，这里提示streaming mode用了多个数据集的话多个数据集不会被mix，意思是按顺序一个数据集一个数据集训练吗？如果要mix的话怎么办？

May 13 '25 12:05 RunsenXu

@RunsenXu https://github.com/hiyouga/LLaMA-Factory/blob/ab2c05115b9800a07b7e7bfb7359c1337856472f/src/llamafactory/hparams/data_args.py#L66-L69 interleave

May 13 '25 13:05 hiyouga

@aliencaocao 嗨，我也遇到了 op 的问题，但是请问为什么最新版的代码没有 dataset_shards 这个参数了，如果我用旧版本的仓库，将不再支持我要微调的 internvl3.请问大佬是有新的解决方法了吗

我的 config yaml：

### model
model_name_or_path: OpenGVLab/InternVL3-8B-hf
trust_remote_code: true

### method
stage: sft
do_train: true
finetuning_type: full
freeze_multi_modal_projector: true
freeze_vision_tower: true
deepspeed: examples/deepspeed/ds_z3_config.json

### dataset
dataset: caption_training_shards_2025_8_11
template: intern_vl
cutoff_len: 20480
overwrite_cache: true
preprocessing_num_workers: 16
# dataset_shards: 8
dataloader_num_workers: 8

### output
output_dir: saves/internvl3-8b/test_train_8_12
logging_steps: 4
save_steps: 500
plot_loss: true
overwrite_output_dir: true
save_only_model: false
report_to: tensorboard  # choices: [none, wandb, tensorboard, swanlab, mlflow]

### train
per_device_train_batch_size: 8
gradient_accumulation_steps: 1
learning_rate: 1.0e-4
num_train_epochs: 1.0
lr_scheduler_type: cosine
warmup_ratio: 0.03
bf16: true
ddp_timeout: 180000000
flash_attn: auto
streaming: true
preprocessing_batch_size: 128
buffer_size: 128
max_steps: 40000
accelerator_config:
  dispatch_batches: false

我的数据量比较大有300万条，每条包含～16个image，所以我选择 streamming 的方法。我的机器cpu100核，内存900GB，GPU为8*A800

Aug 12 '25 13:08 Hhankyangg

这个参数不是我删的，你要找commit这个变动的人问问

Aug 12 '25 15:08 aliencaocao

@aliencaocao 嗨，我也遇到了 op 的问题，但是请问为什么最新版的代码没有 dataset_shards 这个参数了，如果我用旧版本的仓库，将不再支持我要微调的 internvl3.请问大佬是有新的解决方法了吗

我的 config yaml：

model

model_name_or_path: OpenGVLab/InternVL3-8B-hf trust_remote_code: true

method

stage: sft do_train: true finetuning_type: full freeze_multi_modal_projector: true freeze_vision_tower: true deepspeed: examples/deepspeed/ds_z3_config.json

dataset

dataset: caption_training_shards_2025_8_11 template: intern_vl cutoff_len: 20480 overwrite_cache: true preprocessing_num_workers: 16

dataset_shards: 8

dataloader_num_workers: 8

output

output_dir: saves/internvl3-8b/test_train_8_12 logging_steps: 4 save_steps: 500 plot_loss: true overwrite_output_dir: true save_only_model: false report_to: tensorboard # choices: [none, wandb, tensorboard, swanlab, mlflow]

train

per_device_train_batch_size: 8 gradient_accumulation_steps: 1 learning_rate: 1.0e-4 num_train_epochs: 1.0 lr_scheduler_type: cosine warmup_ratio: 0.03 bf16: true ddp_timeout: 180000000 flash_attn: auto streaming: true preprocessing_batch_size: 128 buffer_size: 128 max_steps: 40000 accelerator_config: dispatch_batches: false 我的数据量比较大有300万条，每条包含～16个image，所以我选择 streamming 的方法。我的机器cpu100核，内存900GB，GPU为8*A800

请问解决了吗？可以手动切分数据集，然后dataset指定多个数据集。

Aug 30 '25 19:08 YiandLi

另外想请问下，streaming 模式其实还是不能应对一个完整的超级大的数据集（假设 cpu 放不下的情况下），它不是 iterable DB 是吗？是不是只是说是一种 Lazy tokenzie 的模式，另外 LLama-Factory 是否支持 iterable DB 呢？（我的数据量很大，千万量）。

Aug 30 '25 19:08 YiandLi