ms-swift InternVL3-9B LoRA微调数据集预处理速度缓慢问题（大约7h）

数据集示例

{ "messages": [ { "role": "user", "content": "\nIs there a blue or green color cast in the photo?" }, { "role": "assistant", "content": "Yes" } ], "images": [ "/fine_tune/M_Database/1.jpg" ] }, 我的数据集中共有78170个上面的样本。

环境

RTX 3090 * 4 python 3.10.0 ms-swift 3.4.0

--lazy_tokenize

起初我没有注意到这个参数，官方文档描述它在MLLM微调中默认为True，意味着模型的微调过程会边微调边做数据预处理，在这种情况下我需要11天才能完成微调任务。所以我将其设置为False，但是它的数据预处理过程依然很缓慢，我设置了dataset_num_proc=12依然需要花费大概7小时才能完成。

微调指令

export HF_DATASETS_CACHE="/fine_tune/cachefile/" swift sft
--model /fine_tune/InternVL3-9B/
--train_type lora
--dataset '/fine_tune/InternVL3-9B/swift_data.json'
--enable_cache True
--lazy_tokenize False
--dataset_num_proc 12
--torch_dtype bfloat16
--num_train_epochs 1
--per_device_train_batch_size 4
--per_device_eval_batch_size 4
--learning_rate 1e-4
--lora_rank 8
--lora_alpha 32
--target_modules all-linear
--gradient_accumulation_steps 16
--eval_steps 50
--save_steps 50
--save_total_limit 2
--logging_steps 5
--max_length 2048
--output_dir output
--system 'You are a helpful assistant.'
--warmup_ratio 0.05
--dataloader_num_workers 4 \

May 04 '25 09:05 jxma20

微调和预处理的时间是重叠的

如果需要加速微调过程，可以参考这里：https://github.com/modelscope/ms-swift/blob/main/examples/train/packing/streaming.sh

May 04 '25 09:05 Jintao-Huang

微调和预处理的时间是重叠的

如果需要加速微调过程，可以参考这里：https://github.com/modelscope/ms-swift/blob/main/examples/train/packing/streaming.sh

您好，感谢回复！

根据我在运行过程中的观察，程序只是将模型读入了显存，但是gpu的利用率一直都是接近0，所以我这边应该并不是一边微调一边预处理。我觉得这跟参数lazy_tokenize设置为False表现得一致，先map预处理，然后再执行微调。
然后处理时间或许与InternVL预处理的逻辑有关，会慢一些。但是我在之前微调Qwen2.5VL也做过相关的预处理，只不过我之前是手动做的预处理函数，使用了datasets.map函数。相同的数据集开12个进程处理只需要十几分钟，而这次却要7个小时，确实相差大了些
即使是7个小时，我依然坚持让它map了下去，但是map完打印了一行信息： Dataset filtered, origin length: 77389, filtered dataset length: 18755，我在这个项目的看到过相关的issu，但是并没有回复，想知道这是为什么然后怎么避免
最后程序很不幸地报错了，map完成后它输出了一个样本的input_ids和labels_ids，之后似乎又进行了一个map操作，但是这次直接就报错了，报错信息如下：【input_ids:[……]】【labels_ids: [……]】 Map (num_proc=12): 0%| | 0/18755 [05:16<?, ? examples/s] Traceback (most recent call last): File "/DataB/mjx/.conda/envs/InternVL/lib/python3.10/site-packages/swift/cli/sft.py", line 7, in sft_main() File "/DataB/mjx/.conda/envs/InternVL/lib/python3.10/site-packages/swift/llm/train/sft.py", line 281, in sft_main return SwiftSft(args).main() File "/DataB/mjx/.conda/envs/InternVL/lib/python3.10/site-packages/swift/llm/base.py", line 47, in main result = self.run() File "/DataB/mjx/.conda/envs/InternVL/lib/python3.10/site-packages/swift/llm/train/sft.py", line 121, in run train_dataset, val_dataset = self._encode_dataset(train_dataset, val_dataset) File "/DataB/mjx/.conda/envs/InternVL/lib/python3.10/site-packages/swift/llm/train/sft.py", line 273, in _encode_dataset self.train_msg['train_dataset'] = self._stat_dataset(train_dataset) File "/DataB/mjx/.conda/envs/InternVL/lib/python3.10/site-packages/swift/llm/train/sft.py", line 232, in _stat_dataset dataset = GetLengthPreprocessor()(dataset, num_proc=args.dataset_num_proc) File "/DataB/mjx/.conda/envs/InternVL/lib/python3.10/site-packages/swift/llm/dataset/preprocessor/core.py", line 305, in call dataset_mapped = dataset.map( File "/DataB/mjx/.conda/envs/InternVL/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 557, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs) File "/DataB/mjx/.conda/envs/InternVL/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3171, in map for rank, done, content in iflatmap_unordered( File "/DataB/mjx/.conda/envs/InternVL/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 721, in iflatmap_unordered raise RuntimeError( RuntimeError: One of the subprocesses has abruptly died during map operation.To debug the error, disable multiprocessing.

非常期待您的回复！

May 04 '25 13:05 jxma20

same question

May 30 '25 02:05 codesssss

多模态模型不要使用 lazy_tokenize false

可以参考这里提高训练速度：https://github.com/modelscope/ms-swift/blob/main/examples/train/padding_free/sft.sh

May 30 '25 02:05 Jintao-Huang

多模态模型不要使用 lazy_tokenize false

可以参考这里提高训练速度：https://github.com/modelscope/ms-swift/blob/main/examples/train/padding_free/sft.sh

但是改 lazy_tokenize true之后，整个dataset的map过程过程也异常的慢，还有就是map过程只能使用单线程吗，我尝试修改多线程后map过程会卡死然后报错：RuntimeError: One of the subprocesses has abruptly died during map operation.To debug the error, disable multiprocessing.

May 30 '25 02:05 novisfff

--dataset_num_proc 8

Jun 23 '25 08:06 Jintao-Huang

可以尝试 padding_free; streaming解决问题

Jun 23 '25 08:06 Jintao-Huang