[Bug]设置use_packed_ds=True的时候报错
Checklist
- [ ] 1. I have searched related issues but cannot get the expected help.
- [ ] 2. The bug has not been fixed in the latest version.
- [ ] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
Describe the bug
设置use_packed_ds=True的时候报错
python3.10/site-packages/transformers/trainer.py", line 620, in init raise ValueError( ValueError: The train_dataset does not implement len, max_steps has to be specified. The number of steps needs to be known in advance for the learning rate scheduler.
Reproduction
--use-env --nnodes=1 --node_rank=0 --master_addr=127.0.0.1 --nproc_per_node=1 --master_port=34229 internvl_chat_finetune.py --model_name_or_path /media/user/2.0TB/llmmodel/llam_factory/InternVL2_5-4B --conv_style "internvl2_5" --use_fast_tokenizer False --output_dir work_dirs/internvl_chat_v2_5/internvl2_5_4b_dynamic_res_2nd_finetune_lora --meta_path "./shell/data/internvl_1_2_finetune_custom.json" --overwrite_output_dir True --force_image_size 448 --max_dynamic_patch 6 --down_sample_ratio 0.5 --drop_path_rate 0.0 --freeze_llm True --freeze_mlp True --freeze_backbone True --use_llm_lora 16 --vision_select_layer -1 --dataloader_num_workers 4 --bf16 True --num_train_epochs 1 --per_device_train_batch_size 1 --gradient_accumulation_steps 8 --evaluation_strategy "no" --save_strategy "steps" --save_steps 200 --save_total_limit 1 --learning_rate 4e-5 --weight_decay 0.01 --warmup_ratio 0.03 --lr_scheduler_type "cosine" --logging_steps 1 --max_seq_length 8192 --do_train True --grad_checkpoint True --group_by_length True --dynamic_image_size True --use_thumbnail True --ps_version v2 --deepspeed "zero_stage1_config.json" --report_to "tensorboard"
Environment
ubantu 20
Error traceback
same error
same error here
the error was solved by adding "max_steps": 1000000 in the training.args.
Please set max_steps, since our PackedDataset is implemented base on IterableDataset
Please set
max_steps, since our PackedDataset is implemented base on IterableDataset
Hi thank you for pointing this out.
after seting the max_steps, new error occurs as follows:
0%| | 0/1000000 [00:00<?, ?it/s][2025-08-29 18:55:31,661] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-29 18:55:40,201] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-29 18:55:49,259] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-29 18:55:58,241] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[packed_collate_fn] feature[0] missing data_index; keys=['input_ids', 'labels', 'attention_mask', 'position_ids', 'pixel_values', 'image_flags']
[packed_collate_fn] feature[0] missing data_index; keys=['input_ids', 'labels', 'attention_mask', 'position_ids', 'pixel_values', 'image_flags']
[packed_collate_fn] feature[0] missing data_index; keys=['input_ids', 'labels', 'attention_mask', 'position_ids', 'pixel_values', 'image_flags']
[packed_collate_fn] feature[0] missing data_index; keys=['input_ids', 'labels', 'attention_mask', 'position_ids', 'pixel_values', 'image_flags']
Traceback (most recent call last):
File "/home/yuxin/CL_CoTNav/InternVL/internvl_chat/internvl/train/internvl_chat_finetune.py", line 1072, in
Even i switched the dataset to Coco example, and it worked when data packing is false. when it is true, the above error occurs.
Please set
max_steps, since our PackedDataset is implemented base on IterableDatasetHi thank you for pointing this out.
after seting the max_steps, new error occurs as follows:
0%| | 0/1000000 [00:00<?, ?it/s][2025-08-29 18:55:31,661] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-29 18:55:40,201] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-29 18:55:49,259] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-29 18:55:58,241] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [packed_collate_fn] feature[0] missing data_index; keys=['input_ids', 'labels', 'attention_mask', 'position_ids', 'pixel_values', 'image_flags'] [packed_collate_fn] feature[0] missing data_index; keys=['input_ids', 'labels', 'attention_mask', 'position_ids', 'pixel_values', 'image_flags'] [packed_collate_fn] feature[0] missing data_index; keys=['input_ids', 'labels', 'attention_mask', 'position_ids', 'pixel_values', 'image_flags'] [packed_collate_fn] feature[0] missing data_index; keys=['input_ids', 'labels', 'attention_mask', 'position_ids', 'pixel_values', 'image_flags'] Traceback (most recent call last): File "/home/yuxin/CL_CoTNav/InternVL/internvl_chat/internvl/train/internvl_chat_finetune.py", line 1072, in main() File "/home/yuxin/CL_CoTNav/InternVL/internvl_chat/internvl/train/internvl_chat_finetune.py", line 1057, in main train_result = trainer.train(resume_from_checkpoint=checkpoint) File "/home/yuxin/miniconda3/envs/cl_cotnav/lib/python3.10/site-packages/transformers/trainer.py", line 1539, in train return inner_training_loop( File "/home/yuxin/miniconda3/envs/cl_cotnav/lib/python3.10/site-packages/transformers/trainer.py", line 1836, in _inner_training_loop for step, inputs in enumerate(epoch_iterator): File "/home/yuxin/miniconda3/envs/cl_cotnav/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 631, in next data = self._next_data() File "/home/yuxin/miniconda3/envs/cl_cotnav/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1346, in _next_data return self._process_data(data) File "/home/yuxin/miniconda3/envs/cl_cotnav/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1372, in _process_data data.reraise() File "/home/yuxin/miniconda3/envs/cl_cotnav/lib/python3.10/site-packages/torch/_utils.py", line 705, in reraise raise exception KeyError: Caught KeyError in DataLoader worker process 0. Original Traceback (most recent call last): File "/home/yuxin/miniconda3/envs/cl_cotnav/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop data = fetcher.fetch(index) # type: ignore[possibly-undefined] File "/home/yuxin/miniconda3/envs/cl_cotnav/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 42, in fetch return self.collate_fn(data) File "/home/yuxin/miniconda3/envs/cl_cotnav/lib/python3.10/site-packages/transformers/trainer_utils.py", line 772, in call return self.data_collator(features) File "/home/yuxin/CL_CoTNav/InternVL/internvl_chat/internvl/train/dataset_packed.py", line 596, in packed_collate_fn data_index = feat.pop('data_index') KeyError: 'data_index'
Even i switched the dataset to Coco example, and it worked when data packing is false. when it is true, the above error occurs.
I'm occurring the same issue. Has it been resolved? thx~
unfortunately not yet. waiting for the develop team.
Please set
max_steps, since our PackedDataset is implemented base on IterableDatasetHi thank you for pointing this out. after seting the max_steps, new error occurs as follows: 0%| | 0/1000000 [00:00<?, ?it/s][2025-08-29 18:55:31,661] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-29 18:55:40,201] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-29 18:55:49,259] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-29 18:55:58,241] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [packed_collate_fn] feature[0] missing data_index; keys=['input_ids', 'labels', 'attention_mask', 'position_ids', 'pixel_values', 'image_flags'] [packed_collate_fn] feature[0] missing data_index; keys=['input_ids', 'labels', 'attention_mask', 'position_ids', 'pixel_values', 'image_flags'] [packed_collate_fn] feature[0] missing data_index; keys=['input_ids', 'labels', 'attention_mask', 'position_ids', 'pixel_values', 'image_flags'] [packed_collate_fn] feature[0] missing data_index; keys=['input_ids', 'labels', 'attention_mask', 'position_ids', 'pixel_values', 'image_flags'] Traceback (most recent call last): File "/home/yuxin/CL_CoTNav/InternVL/internvl_chat/internvl/train/internvl_chat_finetune.py", line 1072, in main() File "/home/yuxin/CL_CoTNav/InternVL/internvl_chat/internvl/train/internvl_chat_finetune.py", line 1057, in main train_result = trainer.train(resume_from_checkpoint=checkpoint) File "/home/yuxin/miniconda3/envs/cl_cotnav/lib/python3.10/site-packages/transformers/trainer.py", line 1539, in train return inner_training_loop( File "/home/yuxin/miniconda3/envs/cl_cotnav/lib/python3.10/site-packages/transformers/trainer.py", line 1836, in _inner_training_loop for step, inputs in enumerate(epoch_iterator): File "/home/yuxin/miniconda3/envs/cl_cotnav/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 631, in next data = self._next_data() File "/home/yuxin/miniconda3/envs/cl_cotnav/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1346, in _next_data return self._process_data(data) File "/home/yuxin/miniconda3/envs/cl_cotnav/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1372, in _process_data data.reraise() File "/home/yuxin/miniconda3/envs/cl_cotnav/lib/python3.10/site-packages/torch/_utils.py", line 705, in reraise raise exception KeyError: Caught KeyError in DataLoader worker process 0. Original Traceback (most recent call last): File "/home/yuxin/miniconda3/envs/cl_cotnav/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop data = fetcher.fetch(index) # type: ignore[possibly-undefined] File "/home/yuxin/miniconda3/envs/cl_cotnav/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 42, in fetch return self.collate_fn(data) File "/home/yuxin/miniconda3/envs/cl_cotnav/lib/python3.10/site-packages/transformers/trainer_utils.py", line 772, in call return self.data_collator(features) File "/home/yuxin/CL_CoTNav/InternVL/internvl_chat/internvl/train/dataset_packed.py", line 596, in packed_collate_fn data_index = feat.pop('data_index') KeyError: 'data_index' Even i switched the dataset to Coco example, and it worked when data packing is false. when it is true, the above error occurs.
I'm occurring the same issue. Has it been resolved? thx~
This problem has be solved by:
using the new internvl_chat_gpt_oss; flash_attn==2.7.4.post1 ldd --version | head -n1 --> ldd (Ubuntu GLIBC 2.31-0ubuntu9.9) 2.31 torch==2.5.1+cu121 torchaudio==2.5.1+cu121, torchvision==0.20.1+cu121 transformers==4.55.0 deepspeed==0.14.4 On Ubuntu 20.04. for 2B sft.