InternVL icon indicating copy to clipboard operation
InternVL copied to clipboard

[Bug]设置use_packed_ds=True的时候报错

Open cqray1990 opened this issue 11 months ago • 1 comments

Checklist

  • [ ] 1. I have searched related issues but cannot get the expected help.
  • [ ] 2. The bug has not been fixed in the latest version.
  • [ ] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.

Describe the bug

设置use_packed_ds=True的时候报错

python3.10/site-packages/transformers/trainer.py", line 620, in init raise ValueError( ValueError: The train_dataset does not implement len, max_steps has to be specified. The number of steps needs to be known in advance for the learning rate scheduler.

Reproduction

--use-env --nnodes=1 --node_rank=0 --master_addr=127.0.0.1 --nproc_per_node=1 --master_port=34229 internvl_chat_finetune.py --model_name_or_path /media/user/2.0TB/llmmodel/llam_factory/InternVL2_5-4B --conv_style "internvl2_5" --use_fast_tokenizer False --output_dir work_dirs/internvl_chat_v2_5/internvl2_5_4b_dynamic_res_2nd_finetune_lora --meta_path "./shell/data/internvl_1_2_finetune_custom.json" --overwrite_output_dir True --force_image_size 448 --max_dynamic_patch 6 --down_sample_ratio 0.5 --drop_path_rate 0.0 --freeze_llm True --freeze_mlp True --freeze_backbone True --use_llm_lora 16 --vision_select_layer -1 --dataloader_num_workers 4 --bf16 True --num_train_epochs 1 --per_device_train_batch_size 1 --gradient_accumulation_steps 8 --evaluation_strategy "no" --save_strategy "steps" --save_steps 200 --save_total_limit 1 --learning_rate 4e-5 --weight_decay 0.01 --warmup_ratio 0.03 --lr_scheduler_type "cosine" --logging_steps 1 --max_seq_length 8192 --do_train True --grad_checkpoint True --group_by_length True --dynamic_image_size True --use_thumbnail True --ps_version v2 --deepspeed "zero_stage1_config.json" --report_to "tensorboard"

Environment

ubantu 20

Error traceback


cqray1990 avatar Jan 26 '25 02:01 cqray1990

same error

lainxx avatar Mar 26 '25 06:03 lainxx

same error here

Yuxin916 avatar Aug 09 '25 18:08 Yuxin916

the error was solved by adding "max_steps": 1000000 in the training.args.

Yuxin916 avatar Aug 15 '25 17:08 Yuxin916

Please set max_steps, since our PackedDataset is implemented base on IterableDataset

Weiyun1025 avatar Aug 29 '25 18:08 Weiyun1025

Please set max_steps, since our PackedDataset is implemented base on IterableDataset

Hi thank you for pointing this out.

after seting the max_steps, new error occurs as follows:

0%| | 0/1000000 [00:00<?, ?it/s][2025-08-29 18:55:31,661] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-29 18:55:40,201] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-29 18:55:49,259] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-29 18:55:58,241] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [packed_collate_fn] feature[0] missing data_index; keys=['input_ids', 'labels', 'attention_mask', 'position_ids', 'pixel_values', 'image_flags'] [packed_collate_fn] feature[0] missing data_index; keys=['input_ids', 'labels', 'attention_mask', 'position_ids', 'pixel_values', 'image_flags'] [packed_collate_fn] feature[0] missing data_index; keys=['input_ids', 'labels', 'attention_mask', 'position_ids', 'pixel_values', 'image_flags'] [packed_collate_fn] feature[0] missing data_index; keys=['input_ids', 'labels', 'attention_mask', 'position_ids', 'pixel_values', 'image_flags'] Traceback (most recent call last): File "/home/yuxin/CL_CoTNav/InternVL/internvl_chat/internvl/train/internvl_chat_finetune.py", line 1072, in main() File "/home/yuxin/CL_CoTNav/InternVL/internvl_chat/internvl/train/internvl_chat_finetune.py", line 1057, in main train_result = trainer.train(resume_from_checkpoint=checkpoint) File "/home/yuxin/miniconda3/envs/cl_cotnav/lib/python3.10/site-packages/transformers/trainer.py", line 1539, in train return inner_training_loop( File "/home/yuxin/miniconda3/envs/cl_cotnav/lib/python3.10/site-packages/transformers/trainer.py", line 1836, in _inner_training_loop for step, inputs in enumerate(epoch_iterator): File "/home/yuxin/miniconda3/envs/cl_cotnav/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 631, in next data = self._next_data() File "/home/yuxin/miniconda3/envs/cl_cotnav/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1346, in _next_data return self._process_data(data) File "/home/yuxin/miniconda3/envs/cl_cotnav/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1372, in _process_data data.reraise() File "/home/yuxin/miniconda3/envs/cl_cotnav/lib/python3.10/site-packages/torch/_utils.py", line 705, in reraise raise exception KeyError: Caught KeyError in DataLoader worker process 0. Original Traceback (most recent call last): File "/home/yuxin/miniconda3/envs/cl_cotnav/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop data = fetcher.fetch(index) # type: ignore[possibly-undefined] File "/home/yuxin/miniconda3/envs/cl_cotnav/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 42, in fetch return self.collate_fn(data) File "/home/yuxin/miniconda3/envs/cl_cotnav/lib/python3.10/site-packages/transformers/trainer_utils.py", line 772, in call return self.data_collator(features) File "/home/yuxin/CL_CoTNav/InternVL/internvl_chat/internvl/train/dataset_packed.py", line 596, in packed_collate_fn data_index = feat.pop('data_index') KeyError: 'data_index'

Even i switched the dataset to Coco example, and it worked when data packing is false. when it is true, the above error occurs.

Yuxin916 avatar Aug 29 '25 18:08 Yuxin916

Please set max_steps, since our PackedDataset is implemented base on IterableDataset

Hi thank you for pointing this out.

after seting the max_steps, new error occurs as follows:

0%| | 0/1000000 [00:00<?, ?it/s][2025-08-29 18:55:31,661] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-29 18:55:40,201] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-29 18:55:49,259] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-29 18:55:58,241] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [packed_collate_fn] feature[0] missing data_index; keys=['input_ids', 'labels', 'attention_mask', 'position_ids', 'pixel_values', 'image_flags'] [packed_collate_fn] feature[0] missing data_index; keys=['input_ids', 'labels', 'attention_mask', 'position_ids', 'pixel_values', 'image_flags'] [packed_collate_fn] feature[0] missing data_index; keys=['input_ids', 'labels', 'attention_mask', 'position_ids', 'pixel_values', 'image_flags'] [packed_collate_fn] feature[0] missing data_index; keys=['input_ids', 'labels', 'attention_mask', 'position_ids', 'pixel_values', 'image_flags'] Traceback (most recent call last): File "/home/yuxin/CL_CoTNav/InternVL/internvl_chat/internvl/train/internvl_chat_finetune.py", line 1072, in main() File "/home/yuxin/CL_CoTNav/InternVL/internvl_chat/internvl/train/internvl_chat_finetune.py", line 1057, in main train_result = trainer.train(resume_from_checkpoint=checkpoint) File "/home/yuxin/miniconda3/envs/cl_cotnav/lib/python3.10/site-packages/transformers/trainer.py", line 1539, in train return inner_training_loop( File "/home/yuxin/miniconda3/envs/cl_cotnav/lib/python3.10/site-packages/transformers/trainer.py", line 1836, in _inner_training_loop for step, inputs in enumerate(epoch_iterator): File "/home/yuxin/miniconda3/envs/cl_cotnav/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 631, in next data = self._next_data() File "/home/yuxin/miniconda3/envs/cl_cotnav/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1346, in _next_data return self._process_data(data) File "/home/yuxin/miniconda3/envs/cl_cotnav/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1372, in _process_data data.reraise() File "/home/yuxin/miniconda3/envs/cl_cotnav/lib/python3.10/site-packages/torch/_utils.py", line 705, in reraise raise exception KeyError: Caught KeyError in DataLoader worker process 0. Original Traceback (most recent call last): File "/home/yuxin/miniconda3/envs/cl_cotnav/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop data = fetcher.fetch(index) # type: ignore[possibly-undefined] File "/home/yuxin/miniconda3/envs/cl_cotnav/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 42, in fetch return self.collate_fn(data) File "/home/yuxin/miniconda3/envs/cl_cotnav/lib/python3.10/site-packages/transformers/trainer_utils.py", line 772, in call return self.data_collator(features) File "/home/yuxin/CL_CoTNav/InternVL/internvl_chat/internvl/train/dataset_packed.py", line 596, in packed_collate_fn data_index = feat.pop('data_index') KeyError: 'data_index'

Even i switched the dataset to Coco example, and it worked when data packing is false. when it is true, the above error occurs.

I'm occurring the same issue. Has it been resolved? thx~

Gaoyg avatar Sep 02 '25 11:09 Gaoyg

unfortunately not yet. waiting for the develop team.

Yuxin916 avatar Sep 02 '25 15:09 Yuxin916

Please set max_steps, since our PackedDataset is implemented base on IterableDataset

Hi thank you for pointing this out. after seting the max_steps, new error occurs as follows: 0%| | 0/1000000 [00:00<?, ?it/s][2025-08-29 18:55:31,661] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-29 18:55:40,201] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-29 18:55:49,259] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-29 18:55:58,241] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [packed_collate_fn] feature[0] missing data_index; keys=['input_ids', 'labels', 'attention_mask', 'position_ids', 'pixel_values', 'image_flags'] [packed_collate_fn] feature[0] missing data_index; keys=['input_ids', 'labels', 'attention_mask', 'position_ids', 'pixel_values', 'image_flags'] [packed_collate_fn] feature[0] missing data_index; keys=['input_ids', 'labels', 'attention_mask', 'position_ids', 'pixel_values', 'image_flags'] [packed_collate_fn] feature[0] missing data_index; keys=['input_ids', 'labels', 'attention_mask', 'position_ids', 'pixel_values', 'image_flags'] Traceback (most recent call last): File "/home/yuxin/CL_CoTNav/InternVL/internvl_chat/internvl/train/internvl_chat_finetune.py", line 1072, in main() File "/home/yuxin/CL_CoTNav/InternVL/internvl_chat/internvl/train/internvl_chat_finetune.py", line 1057, in main train_result = trainer.train(resume_from_checkpoint=checkpoint) File "/home/yuxin/miniconda3/envs/cl_cotnav/lib/python3.10/site-packages/transformers/trainer.py", line 1539, in train return inner_training_loop( File "/home/yuxin/miniconda3/envs/cl_cotnav/lib/python3.10/site-packages/transformers/trainer.py", line 1836, in _inner_training_loop for step, inputs in enumerate(epoch_iterator): File "/home/yuxin/miniconda3/envs/cl_cotnav/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 631, in next data = self._next_data() File "/home/yuxin/miniconda3/envs/cl_cotnav/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1346, in _next_data return self._process_data(data) File "/home/yuxin/miniconda3/envs/cl_cotnav/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1372, in _process_data data.reraise() File "/home/yuxin/miniconda3/envs/cl_cotnav/lib/python3.10/site-packages/torch/_utils.py", line 705, in reraise raise exception KeyError: Caught KeyError in DataLoader worker process 0. Original Traceback (most recent call last): File "/home/yuxin/miniconda3/envs/cl_cotnav/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop data = fetcher.fetch(index) # type: ignore[possibly-undefined] File "/home/yuxin/miniconda3/envs/cl_cotnav/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 42, in fetch return self.collate_fn(data) File "/home/yuxin/miniconda3/envs/cl_cotnav/lib/python3.10/site-packages/transformers/trainer_utils.py", line 772, in call return self.data_collator(features) File "/home/yuxin/CL_CoTNav/InternVL/internvl_chat/internvl/train/dataset_packed.py", line 596, in packed_collate_fn data_index = feat.pop('data_index') KeyError: 'data_index' Even i switched the dataset to Coco example, and it worked when data packing is false. when it is true, the above error occurs.

I'm occurring the same issue. Has it been resolved? thx~

This problem has be solved by:

using the new internvl_chat_gpt_oss; flash_attn==2.7.4.post1 ldd --version | head -n1 --> ldd (Ubuntu GLIBC 2.31-0ubuntu9.9) 2.31 torch==2.5.1+cu121 torchaudio==2.5.1+cu121, torchvision==0.20.1+cu121 transformers==4.55.0 deepspeed==0.14.4 On Ubuntu 20.04. for 2B sft.

Yuxin916 avatar Sep 22 '25 21:09 Yuxin916