关于Internvl3.5的训练
非常amazing的工作!paper里面提到是基于xtuner训练的,是否会开源相关的训练code呢?
感谢您的关注,下个月会开源。
您可以先基于我们新release的代码进行微调,我们验证过训练结果的正确性是没有问题的,脚本可以参考GPT-OSS的脚本,如果训练的是基于qwen3的模型,记得把--conv_style "internvl3_5_gpt_oss"改成--conv_style "internvl2_5"
您可以先基于我们新release的代码进行微调,我们验证过训练结果的正确性是没有问题的,脚本可以参考GPT-OSS的脚本,如果训练的是基于qwen3的模型,记得把
--conv_style "internvl3_5_gpt_oss"改成--conv_style "internvl2_5"
请问正常下游任务微调的话,可以用现在开源的internvl3的脚本嘛
我们已经release我们的微调脚本,请参考脚本
我们已经release我们的微调脚本,请参考脚本
用参考的脚本以及官方的conda环境包,会出现以下问题,请问该如何解决呢 [rank5]: File "/InternVL/internvl_chat_gpt_oss/internvl/utils/s3_config.py", line 90, in init [rank5]: raise exception.ConfigFileNotFoundError(conf_path) [rank5]: internvl.utils.s3_exception.ConfigFileNotFoundError: ConfigFileNotFoundError(/code/petreloss.conf) [TCSLoader] config_path: petreloss.conf
我们修复了这个问题,你可以pull一下最新的代码试试,这个petreloss.conf是用来配置对象存储的,之前的代码在识别是否启用对象存储的时候有点问题,目前已经修复了
我们修复了这个问题,你可以pull一下最新的代码试试,这个pe
petreloss.conf是用来配置对象存储的,之前的代码在识别是否启用对象存储的时候有点问题,目前已经修复了
新版本上述问题解决了,但在下游数据集sft时,位置编码维度不匹配 File "/InternVL/internvl_chat_gpt_oss/internvl/patch/flash_sink_attn_monkey_patch.py", line 85, in _forward_gpt_oss_with_varlen query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin) File "/root/miniconda3/envs/internvl3.5/lib/python3.10/site-packages/transformers/models/gpt_oss/modeling_gpt_oss.py", line 222, in apply_rotary_pos_emb q_embed = _apply_rotary_emb(q, cos, sin) File "/root/miniconda3/envs/internvl3.5/lib/python3.10/site-packages/transformers/models/gpt_oss/modeling_gpt_oss.py", line 214, in apply_rotary_emb first = first_half * cos - second_half * sin RuntimeError: The size of tensor a (64) must match the size of tensor b (128) at non-singleton dimension 3
训练脚本和给出的示例一样:
torchrun
--node-rank=0
--nnodes=1
--nproc-per-node=${GPUS}
--master-addr=127.0.0.1
--master-port=$MASTER_PORT
/InternVL/internvl_chat_gpt_oss/internvl/train/internvl_chat_finetune.py
--model_name_or_path ${model_path}
--conv_style "internvl2_5"
--use_fast_tokenizer False
--output_dir ${OUTPUT_DIR}
--meta_path ${meta_path}
--overwrite_output_dir True
--force_image_size 448
--max_dynamic_patch 12
--down_sample_ratio ${down_sample_ratio}
--drop_path_rate 0.0
--min_num_frame 8
--max_num_frame 32
--freeze_llm False
--freeze_mlp False
--freeze_backbone True
--vision_select_layer -1
--dataloader_num_workers 16
--bf16 True
--max_steps ${max_steps}
--per_device_train_batch_size ${PER_DEVICE_BATCH_SIZE}
--gradient_accumulation_steps ${GRADIENT_ACC}
--save_strategy "steps"
--save_steps 9e99
--save_total_limit 1
--learning_rate ${lr}
--weight_decay 0.1
--warmup_ratio 0.01
--lr_scheduler_type "cosine"
--logging_steps 1
--max_seq_length ${MAX_SEQ_LENGTH}
--split_annotations True
--do_train True
--grad_checkpoint True
--gradient_checkpointing True
--group_by_length False
--dynamic_image_size True
--use_thumbnail True
--ps_version 'v2'
--use_custom_flash_attn True
--run_name ${run_name}
--report_to "wandb"
--deepspeed "/InternVL/internvl_chat_gpt_oss/zero_stage3_config.json"
--use_packed_ds True
--num_images_expected 96
--max_packed_tokens 32768
--max_buffer_size 20
--log_freq 1000
--strict_mode False
--replacement True
--allow_overflow False
--remove_unused_columns False
--loss_reduction "square"
--seed 42
2>&1 | tee "${OUTPUT_DIR}/training_log.txt"
训练4B模型的话不应该进入GPT-OSS的apply_rotary_pos_emb函数吧,这里--use_custom_flash_attn True需要设置成--use_custom_flash_attn False,也是昨晚的PR一起改的,麻烦改成False之后再试试呢
File "/root/miniconda3/envs/internvl3.5/lib/python3.10/site-packages/transformers/models/gpt_oss/modeling_gpt_oss.py", line 222, in apply_rotary_pos_emb
@Weiyun1025 hi, 使用下面的脚本进行sft微调,双机a800,发现卡在forward language_model,请问有碰到过吗 set -x
export MASTER_PORT=34235 export TF_CPP_MIN_LOG_LEVEL=3 export USE_TCS_LOADER=0 export LAUNCHER=pytorch
# Set the task name CURRENT_PATH=$(pwd) PROJECT_NAME=internvl3_5_30b_sft TASK_NAME=$(basename "$0") TASK_NAME="${TASK_NAME%.*}" echo "TASK_NAME: $TASK_NAME" echo "PROJECT_NAME: $PROJECT_NAME"
export OUTPUT_DIR=${CURRENT_PATH}/work_dirs/${PROJECT_NAME}/${TASK_NAME} export TENSORBOARD_DIR=${OUTPUT_DIR}/tensorboard export JOBLOG=${OUTPUT_DIR}/training.log
if [ ! -d "$OUTPUT_DIR" ]; then mkdir -p "$OUTPUT_DIR" fi
NPROC_PER_NODE=${NPROC_PER_NODE:-8} BATCH_SIZE=${BATCH_SIZE:-512} PER_DEVICE_BATCH_SIZE=${PER_DEVICE_BATCH_SIZE:-1} GRADIENT_ACC=$((BATCH_SIZE / PER_DEVICE_BATCH_SIZE / WORLD_SIZE / NPROC_PER_NODE))
export PYTHONPATH="${PYTHONPATH}:$(pwd)" export TRITON_CACHE_DIR="/dev/shm/triton_wwy/" export VLLM_CACHE_ROOT="/dev/shm/vllmca_wwy/"
export MASTER_PORT=34229 export TF_CPP_MIN_LOG_LEVEL=3 export LAUNCHER=pytorch
torchrun
--node-rank=$RANK
--nnodes=$WORLD_SIZE
--nproc-per-node=$NPROC_PER_NODE
--master-addr=$MASTER_ADDR
--master-port=$MASTER_PORT
internvl/train/internvl_chat_finetune.py
--model_name_or_path "/mnt/internvl/InternVL3_5-30B-A3B-Instruct"
--conv_style "internvl2_5"
--use_fast_tokenizer False
--output_dir ${OUTPUT_DIR}
--meta_path "${CURRENT_PATH}/shell/data/debug_sft.json"
--overwrite_output_dir True
--force_image_size 448
--max_dynamic_patch 12
--down_sample_ratio 0.5
--drop_path_rate 0.1
--min_num_frame 8
--max_num_frame 32
--freeze_llm False
--freeze_mlp False
--freeze_backbone False
--vision_select_layer -1
--dataloader_num_workers 16
--bf16 True
--max_steps 8000
--per_device_train_batch_size ${PER_DEVICE_BATCH_SIZE}
--gradient_accumulation_steps ${GRADIENT_ACC}
--save_strategy "steps"
--save_steps 2
--save_total_limit 100
--learning_rate 8e-5
--weight_decay 0.05
--warmup_ratio 0.03
--lr_scheduler_type "cosine"
--logging_steps 1
--max_seq_length 32768
--split_annotations True
--do_train True
--do_eval False
--grad_checkpoint True
--gradient_checkpointing True
--group_by_length False
--dynamic_image_size True
--use_thumbnail True
--ps_version 'v2'
--use_custom_flash_attn False
--report_to "tensorboard"
--deepspeed "zero_stage3_config.json"
--use_packed_ds True
--num_images_expected 96
--max_packed_tokens 32768
--max_buffer_size 20
--log_freq 1000
--strict_mode False
--replacement True
--allow_overflow False
--remove_unused_columns False
--loss_reduction "square"
--seed 42
2>&1 | tee -a "${OUTPUT_DIR}/training_log.txt"
enter language_model enter language_model enter language_model enter language_model enter language_model enter language_model enter language_model total_samples=1, num_samples=1, num_padding_tokens=0, num_padding_images=0, num_effective_tokens=11 enter language_model
@Weiyun1025 hi, 使用下面的脚本进行sft微调,双机a800,发现卡在forward language_model,请问有碰到过吗 set -x
export MASTER_PORT=34235 export TF_CPP_MIN_LOG_LEVEL=3 export USE_TCS_LOADER=0 export LAUNCHER=pytorch
Set the task name CURRENT_PATH=$(pwd) PROJECT_NAME=internvl3_5_30b_sft TASK_NAME=$(basename "$0") TASK_NAME="${TASK_NAME%.*}" echo "TASK_NAME: $TASK_NAME" echo "PROJECT_NAME: $PROJECT_NAME"
export OUTPUT_DIR=${CURRENT_PATH}/work_dirs/${PROJECT_NAME}/${TASK_NAME} export TENSORBOARD_DIR=${OUTPUT_DIR}/tensorboard export JOBLOG=${OUTPUT_DIR}/training.log
if [ ! -d "$OUTPUT_DIR" ]; then mkdir -p "$OUTPUT_DIR" fi
NPROC_PER_NODE=${NPROC_PER_NODE:-8} BATCH_SIZE=${BATCH_SIZE:-512} PER_DEVICE_BATCH_SIZE=${PER_DEVICE_BATCH_SIZE:-1} GRADIENT_ACC=$((BATCH_SIZE / PER_DEVICE_BATCH_SIZE / WORLD_SIZE / NPROC_PER_NODE))
export PYTHONPATH="${PYTHONPATH}:$(pwd)" export TRITON_CACHE_DIR="/dev/shm/triton_wwy/" export VLLM_CACHE_ROOT="/dev/shm/vllmca_wwy/"
export MASTER_PORT=34229 export TF_CPP_MIN_LOG_LEVEL=3 export LAUNCHER=pytorch
torchrun --node-rank=$RANK --nnodes=$WORLD_SIZE --nproc-per-node=$NPROC_PER_NODE --master-addr=$MASTER_ADDR --master-port=$MASTER_PORT internvl/train/internvl_chat_finetune.py --model_name_or_path "/mnt/internvl/InternVL3_5-30B-A3B-Instruct" --conv_style "internvl2_5" --use_fast_tokenizer False --output_dir ${OUTPUT_DIR} --meta_path "${CURRENT_PATH}/shell/data/debug_sft.json" --overwrite_output_dir True --force_image_size 448 --max_dynamic_patch 12 --down_sample_ratio 0.5 --drop_path_rate 0.1 --min_num_frame 8 --max_num_frame 32 --freeze_llm False --freeze_mlp False --freeze_backbone False --vision_select_layer -1 --dataloader_num_workers 16 --bf16 True --max_steps 8000 --per_device_train_batch_size ${PER_DEVICE_BATCH_SIZE} --gradient_accumulation_steps ${GRADIENT_ACC} --save_strategy "steps" --save_steps 2 --save_total_limit 100 --learning_rate 8e-5 --weight_decay 0.05 --warmup_ratio 0.03 --lr_scheduler_type "cosine" --logging_steps 1 --max_seq_length 32768 --split_annotations True --do_train True --do_eval False --grad_checkpoint True --gradient_checkpointing True --group_by_length False --dynamic_image_size True --use_thumbnail True --ps_version 'v2' --use_custom_flash_attn False --report_to "tensorboard" --deepspeed "zero_stage3_config.json" --use_packed_ds True --num_images_expected 96 --max_packed_tokens 32768 --max_buffer_size 20 --log_freq 1000 --strict_mode False --replacement True --allow_overflow False --remove_unused_columns False --loss_reduction "square" --seed 42 2>&1 | tee -a "${OUTPUT_DIR}/training_log.txt"
enter language_model enter language_model enter language_model enter language_model enter language_model enter language_model enter language_model total_samples=1, num_samples=1, num_padding_tokens=0, num_padding_images=0, num_effective_tokens=11 enter language_model
大概卡了多久呢,这套代码没有对MoE优化过,所以训练确实会比较慢,30B MoE的速度大概和38B差不多,不一定是卡住了
训练4B模型的话不应该进入GPT-OSS的
apply_rotary_pos_emb函数吧,这里--use_custom_flash_attn True需要设置成--use_custom_flash_attn False,也是昨晚的PR一起改的,麻烦改成False之后再试试呢File "/root/miniconda3/envs/internvl3.5/lib/python3.10/site-packages/transformers/models/gpt_oss/modeling_gpt_oss.py", line 222, in apply_rotary_pos_emb
感谢,修改为False目前可以正常训练了。但似乎训练时间长了很多,相比internvl2.5-4B,训练速度慢了6倍
大概卡了多久呢,这套代码没有对MoE优化过,所以训练确实会比较慢,30B MoE的速度大概和38B差不多,不一定是卡住了
@Weiyun1025 30min后,nccl timeout
[rank3]:[E902 12:24:58.761850774 ProcessGroupNCCL.cpp:632] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=77419, OpType=_ALLGATHER_BASE, NumelIn=98304, NumelOut=1572864, Timeout(ms)=1800000) ran for 1800024 milliseconds before timing out. [rank8]:[E902 12:24:58.757408621 ProcessGroupNCCL.cpp:632] [Rank 8] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=77419, OpType=_ALLGATHER_BASE, NumelIn=98304, NumelOut=1572864, Timeout(ms)=1800000) ran for 1800024 milliseconds before timing out. [rank3]:[E902 12:24:58.761992657 ProcessGroupNCCL.cpp:2268] [PG ID 0 PG GUID 0(default_pg) Rank 3] failure detected by watchdog at work sequence id: 77419 PG status: last enqueued work: 77421, last completed work: 77418 [rank3]:[E902 12:24:58.762005742 ProcessGroupNCCL.cpp:670] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value. [rank3]:[E902 12:24:58.762044717 ProcessGroupNCCL.cpp:2103] [PG ID 0 PG GUID 0(default_pg) Rank 3] First PG on this rank to signal dumping.
训练4B模型的话不应该进入GPT-OSS的
apply_rotary_pos_emb函数吧,这里--use_custom_flash_attn True需要设置成--use_custom_flash_attn False,也是昨晚的PR一起改的,麻烦改成False之后再试试呢 File "/root/miniconda3/envs/internvl3.5/lib/python3.10/site-packages/transformers/models/gpt_oss/modeling_gpt_oss.py", line 222, in apply_rotary_pos_emb感谢,修改为False目前可以正常训练了。但似乎训练时间长了很多,相比internvl2.5-4B,训练速度慢了6倍
4B的训练速度不该变慢才对,你是不是对比的VL3.5开packing的单个iter速度和vl2.5不开packing的单个iter速度呢
训练4B模型的话不应该进入GPT-OSS的
apply_rotary_pos_emb函数吧,这里--use_custom_flash_attn True需要设置成--use_custom_flash_attn False,也是昨晚的PR一起改的,麻烦改成False之后再试试呢 File "/root/miniconda3/envs/internvl3.5/lib/python3.10/site-packages/transformers/models/gpt_oss/modeling_gpt_oss.py", line 222, in apply_rotary_pos_emb感谢,修改为False目前可以正常训练了。但似乎训练时间长了很多,相比internvl2.5-4B,训练速度慢了6倍
4B的训练速度不该变慢才对,你是不是对比的VL3.5开packing的单个iter速度和vl2.5不开packing的单个iter速度呢
是的
你得对比TGS(Token per Gpu per Second),packing之后每个iter计算的样本量和token量都是大幅增加的,不能直接比较单个iter时间
你得对比TGS(Token per Gpu per Second),packing之后每个iter计算的样本量和token量都是大幅增加的,不能直接比较单个iter时间
比较的是相同训练数据,训练一个epoch所需的总时间。之前2.5 4B 4h33min,现在需要24h。在八张A800机器上。 不同的是现在通过--max_steps设置一个epoch,计算方法是max_steps = (num_samples // batch_size) * num_epochs。请问是我的计算过程有问题吗?
你得对比TGS(Token per Gpu per Second),packing之后每个iter计算的样本量和token量都是大幅增加的,不能直接比较单个iter时间
比较的是相同训练数据,训练一个epoch所需的总时间。之前2.5 4B 4h33min,现在需要24h。在八张A800机器上。 不同的是现在通过--max_steps设置一个epoch,计算方法是max_steps = (num_samples // batch_size) * num_epochs。请问是我的计算过程有问题吗?
这样的话应该过了不止一个epoch,每个batch_size是1,但是每条序列里包含了不止一个样本,所以不能这样简单的计算,可以参考我们代码里这个位置的log来估算一下每个iter过了多少样本,然后利用这个值去倒推max_steps该设置成多少
你得对比TGS(Token per Gpu per Second),packing之后每个iter计算的样本量和token量都是大幅增加的,不能直接比较单个iter时间
比较的是相同训练数据,训练一个epoch所需的总时间。之前2.5 4B 4h33min,现在需要24h。在八张A800机器上。 不同的是现在通过--max_steps设置一个epoch,计算方法是max_steps = (num_samples // batch_size) * num_epochs。请问是我的计算过程有问题吗?
这样的话应该过了不止一个epoch,每个batch_size是1,但是每条序列里包含了不止一个样本,所以不能这样简单的计算,可以参考我们代码里这个位置的log来估算一下每个iter过了多少样本,然后利用这个值去倒推
max_steps该设置成多少
好的好的,感谢。
你得对比TGS(Token per Gpu per Second),packing之后每个iter计算的样本量和token量都是大幅增加的,不能直接比较单个iter时间
比较的是相同训练数据,训练一个epoch所需的总时间。之前2.5 4B 4h33min,现在需要24h。在八张A800机器上。 不同的是现在通过--max_steps设置一个epoch,计算方法是max_steps = (num_samples // batch_size) * num_epochs。请问是我的计算过程有问题吗?
这样的话应该过了不止一个epoch,每个batch_size是1,但是每条序列里包含了不止一个样本,所以不能这样简单的计算,可以参考我们代码里这个位置的log来估算一下每个iter过了多少样本,然后利用这个值去倒推
max_steps该设置成多少
Hello,
What exactly does the number represent for the number of samples passed for each iteration? is it total_samples, num_samples, num_padding_tokens, num_padding_images, or num_effective_tokens?
I have one more question. With respect to split_annotations = True, how can I calculate the number of iterations per epoch?. Would you please clarify this code?
# Recalculate the num iterations
if data_args.split_annotations:
num_iter = total_length // (training_args.per_device_batch_size * training_args.gradient_accumulation_steps) * training_args.num_train_epochs
else:
num_iter = total_length // data_args.global_batch_size * training_args.num_train_epochs
where global_batch_size = per_device_batch_size * world_size * gradient_accumulation_steps
大概卡了多久呢,这套代码没有对MoE优化过,所以训练确实会比较慢,30B MoE的速度大概和38B差不多,不一定是卡住了
@Weiyun1025 30min后,nccl timeout
[rank3]:[E902 12:24:58.761850774 ProcessGroupNCCL.cpp:632] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=77419, OpType=_ALLGATHER_BASE, NumelIn=98304, NumelOut=1572864, Timeout(ms)=1800000) ran for 1800024 milliseconds before timing out. [rank8]:[E902 12:24:58.757408621 ProcessGroupNCCL.cpp:632] [Rank 8] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=77419, OpType=_ALLGATHER_BASE, NumelIn=98304, NumelOut=1572864, Timeout(ms)=1800000) ran for 1800024 milliseconds before timing out. [rank3]:[E902 12:24:58.761992657 ProcessGroupNCCL.cpp:2268] [PG ID 0 PG GUID 0(default_pg) Rank 3] failure detected by watchdog at work sequence id: 77419 PG status: last enqueued work: 77421, last completed work: 77418 [rank3]:[E902 12:24:58.762005742 ProcessGroupNCCL.cpp:670] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value. [rank3]:[E902 12:24:58.762044717 ProcessGroupNCCL.cpp:2103] [PG ID 0 PG GUID 0(default_pg) Rank 3] First PG on this rank to signal dumping.
我也遇到了这个问题,请问您解决了吗?
大概卡了多久呢,这套代码没有对MoE优化过,所以训练确实会比较慢,30B MoE的速度大概和38B差不多,不一定是卡住了
@Weiyun1025 30min后,nccl timeout [rank3]:[E902 12:24:58.761850774 ProcessGroupNCCL.cpp:632] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=77419, OpType=_ALLGATHER_BASE, NumelIn=98304, NumelOut=1572864, Timeout(ms)=1800000) ran for 1800024 milliseconds before timing out. [rank8]:[E902 12:24:58.757408621 ProcessGroupNCCL.cpp:632] [Rank 8] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=77419, OpType=_ALLGATHER_BASE, NumelIn=98304, NumelOut=1572864, Timeout(ms)=1800000) ran for 1800024 milliseconds before timing out. [rank3]:[E902 12:24:58.761992657 ProcessGroupNCCL.cpp:2268] [PG ID 0 PG GUID 0(default_pg) Rank 3] failure detected by watchdog at work sequence id: 77419 PG status: last enqueued work: 77421, last completed work: 77418 [rank3]:[E902 12:24:58.762005742 ProcessGroupNCCL.cpp:670] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value. [rank3]:[E902 12:24:58.762044717 ProcessGroupNCCL.cpp:2103] [PG ID 0 PG GUID 0(default_pg) Rank 3] First PG on this rank to signal dumping.
我也遇到了这个问题,请问您解决了吗?
我也是双机训练遇到这个问题,卡在feed forward后nccl time out了,请问有解决办法吗
I also meet nccl time out error when finetuning MOE.
I also meet nccl time out error when finetuning internvl3.0
30b-a3b 的 moe 模型训练卡住,我也遇到了相似问题,卡住之前 gpu 利用率全部 100% https://github.com/OpenGVLab/InternVL/issues/1193