A Question Regarding Resuming InternVL Training

Open sunye23 opened this issue 10 months ago • 1 comments

Hi, I am currently training the InternVL with xtuner. However, I have encountered an issue with resuming training, and I would greatly appreciate your assistance.

Specifically, I am running distributed training on a SLURM cluster. Due to resource constraints, I can only allocate a few hours per job. Consequently, I need to resume training multiple times using checkpoint files from the .pth folder (e.g., mp_rank_00_model_states.pt). Unfortunately, each resume operation incurs a substantial delay during the “mmengine - WARNING - Advance dataloader 14000 steps to skip data that has already been trained” phase.

Could you please advise if there is any procedure or configuration setting to avoid this lengthy skipping process without compromising training performance?

Feb 11 '25 14:02 sunye23

Hi, I am currently training the InternVL with xtuner. However, I have encountered an issue with resuming training, and I would greatly appreciate your assistance.

Specifically, I am running distributed training on a SLURM cluster. Due to resource constraints, I can only allocate a few hours per job. Consequently, I need to resume training multiple times using checkpoint files from the .pth folder (e.g., mp_rank_00_model_states.pt). Unfortunately, each resume operation incurs a substantial delay during the “mmengine - WARNING - Advance dataloader 14000 steps to skip data that has already been trained” phase.

Could you please advise if there is any procedure or configuration setting to avoid this lengthy skipping process without compromising training performance?

same question. Do you have any solution? :)

Jun 27 '25 17:06 dsn01