MiniCPM-V llamafactory 微调数据量较大，每次都会在 tokenizer 阶段卡很久，然后因为通信时间过长失败

相关库版本：

accelerate                   1.6.0
datasets                     3.5.0
deepspeed                    0.16.5
flash-attn                   2.7.0.post2
huggingface-hub              0.30.2
llamafactory                 0.9.3.dev0
tokenizers                   0.21.1
torch                        2.4.0+cu121
torchaudio                   2.4.0+cu121
torchlibrosa                 0.1.0
torchvision                  0.19.0+cu121
transformers                 4.48.3

数据量级：~400w 机器：2 * 8 H100

每次都会卡在下面 tokenizer 阶段，随着时间推移越加载越慢，直至通信终端：

Running tokenizer on dataset (num_proc=32):  18%|█▊        | 712000/3983746 [1:00:39<1:46:43, 510.90 examples/s]
Running tokenizer on dataset (num_proc=32):  18%|█▊        | 713000/3983746 [1:00:44<2:28:54, 366.09 examples/s]
Running tokenizer on dataset (num_proc=32):  18%|█▊        | 714000/3983746 [1:00:45<2:02:31, 444.74 examples/s]
Running tokenizer on dataset (num_proc=32):  18%|█▊        | 715000/3983746 [1:00:45<1:35:48, 568.67 examples/s]
Running tokenizer on dataset (num_proc=32):  18%|█▊        | 716000/3983746 [1:00:55<3:41:42, 245.64 examples/s]
Running tokenizer on dataset (num_proc=32):  18%|█▊        | 717000/3983746 [1:00:58<3:31:58, 256.85 examples/s]
Running tokenizer on dataset (num_proc=32):  18%|█▊        | 718000/3983746 [1:00:59<2:36:00, 348.87 examples/s]
Running tokenizer on dataset (num_proc=32):  18%|█▊        | 718000/3983746 [1:01:13<2:36:00, 348.87 examples/s]
Running tokenizer on dataset (num_proc=32):  18%|█▊        | 719000/3983746 [1:01:22<8:04:37, 112.28 examples/s]
Running tokenizer on dataset (num_proc=32):  18%|█▊        | 720000/3983746 [1:01:30<7:51:20, 115.41 examples/s]
Running tokenizer on dataset (num_proc=32):  18%|█▊        | 720000/3983746 [1:01:43<7:51:20, 115.41 examples/s]
Running tokenizer on dataset (num_proc=32):  18%|█▊        | 721000/3983746 [1:01:50<10:56:12, 82.87 examples/s]
Running tokenizer on dataset (num_proc=32):  18%|█▊        | 722000/3983746 [1:01:53<8:22:41, 108.14 examples/s]

有办法改成 get_item 的时候再 tokenize 吗，不要一开始就把全部数据加载。

另外，尝试了 llamafatory 的 streaming 模式，会报错，不明原因：

Apr 29 '25 06:04 TimeOverflow

这个我们看一下代码改动一下

Jun 23 '25 04:06 qyc-98

您好，感谢您的反馈。

由于您当前的数据量较大，在预处理阶段出现了显著的 I/O 与内存瓶颈，导致 tokenizer 进程逐渐变慢，最终因通信超时而中断。您可以尝试缩小数据集或联系LLaMA-Factory官方修改数据加载逻辑。此外，关于您提到的 streaming 模式报错问题，根据现有信息分析，可能与数据集格式或组织方式有关，建议检查数据格式是否符合流式读取的要求。如能提供更详细的错误日志和数据集结构，我们将能进一步定位问题。

Aug 14 '25 07:08 ZMXJJ