ms-swift icon indicating copy to clipboard operation
ms-swift copied to clipboard

Qwen3-VL-235B-A22B SFT OOM issue

Open Li-Jicheng opened this issue 2 months ago • 6 comments

I'm using 8×8 A100 GPUs to fine-tune (SFT) the Qwen3-VL-235B-A22B-Instruct model, but I keep encountering out-of-memory (OOM) issues regardless of the settings. Could you please advise me on how to resolve this? I’ve set IMAGE_MAX_TOKEN_NUM to 256, and each query contains at most two images. Based on my estimate, the total number of prompt tokens—including image tokens—should not exceed 1024, given the 256-token limit per image. Thanks

#! GQA limit: tensor_model_parallel_size=4 #sft command PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True'
OMP_NUM_THREADS=14
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
NNODES=8
NPROC_PER_NODE=8
NODE_RANK=$NODE_RANK
MASTER_ADDR=$MASTER_ADDR
MASTER_PORT=$MASTER_PORT
IMAGE_MAX_TOKEN_NUM=256
megatron sft
--load $MODEL_PATH
--dataset $DATA_PATH
--save $OUTPUT_DIR
--save_interval 300
--max_epochs 10
--tensor_model_parallel_size 4
--pipeline_model_parallel_size 2
--expert_model_parallel_size 8
--expert_tensor_parallel_size 1
--micro_batch_size 1
--global_batch_size 16
--sequence_parallel true
--moe_expert_capacity_factor 1
--moe_permute_fusion true
--moe_grouped_gemm true
--moe_shared_expert_overlap true
--recompute_granularity full
--recompute_method uniform
--recompute_num_layers 1
--finetune true
--cross_entropy_loss_fusion true
--lr 1e-5
--lr_warmup_fraction 0.05
--min_lr 1e-7
--moe_aux_loss_coeff 1e-6
--max_length 1024
--attention_backend flash
--optimizer_cpu_offload true
--optimizer_offload_fraction 1
--use_precision_aware_optimizer true
--packing true
--attn_impl flash_attn
--bf16 true
--num_workers 8
--no_save_optim true
--no_save_rng true
--dataset_num_proc 8
--freeze_vit true
--freeze_llm false
--freeze_aligner false
--split_dataset_ratio 0
--log_interval 1

Li-Jicheng avatar Nov 12 '25 03:11 Li-Jicheng

I'm using 8×8 A100 GPUs to fine-tune (SFT) the Qwen3-VL-235B-A22B-Instruct model, but I keep encountering out-of-memory (OOM) issues regardless of the settings. Could you please advise me on how to resolve this? I’ve set IMAGE_MAX_TOKEN_NUM to 256, and each query contains at most two images. Based on my estimate, the total number of prompt tokens—including image tokens—should not exceed 1024, given the 256-token limit per image. Thanks

#! GQA limit: tensor_model_parallel_size=4 #sft command PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' OMP_NUM_THREADS=14 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 NNODES=8 NPROC_PER_NODE=8 NODE_RANK=$NODE_RANK MASTER_ADDR=$MASTER_ADDR MASTER_PORT=$MASTER_PORT IMAGE_MAX_TOKEN_NUM=256 megatron sft --load $MODEL_PATH --dataset $DATA_PATH --save $OUTPUT_DIR --save_interval 300 --max_epochs 10 --tensor_model_parallel_size 4 --pipeline_model_parallel_size 2 --expert_model_parallel_size 8 --expert_tensor_parallel_size 1 --micro_batch_size 1 --global_batch_size 16 --sequence_parallel true --moe_expert_capacity_factor 1 --moe_permute_fusion true --moe_grouped_gemm true --moe_shared_expert_overlap true --recompute_granularity full --recompute_method uniform --recompute_num_layers 1 --finetune true --cross_entropy_loss_fusion true --lr 1e-5 --lr_warmup_fraction 0.05 --min_lr 1e-7 --moe_aux_loss_coeff 1e-6 --max_length 1024 --attention_backend flash --optimizer_cpu_offload true --optimizer_offload_fraction 1 --use_precision_aware_optimizer true --packing true --attn_impl flash_attn --bf16 true --num_workers 8 --no_save_optim true --no_save_rng true --dataset_num_proc 8 --freeze_vit true --freeze_llm false --freeze_aligner false --split_dataset_ratio 0 --log_interval 1

I'm using 8×8 A100 GPUs to fine-tune (SFT) the Qwen3-VL-235B-A22B-Instruct model, but I keep encountering out-of-memory (OOM) issues regardless of the settings. Could you please advise me on how to resolve this? I’ve set IMAGE_MAX_TOKEN_NUM to 256, and each query contains at most two images. Based on my estimate, the total number of prompt tokens—including image tokens—should not exceed 1024, given the 256-token limit per image. Thanks

#! GQA limit: tensor_model_parallel_size=4 #sft command PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' OMP_NUM_THREADS=14 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 NNODES=8 NPROC_PER_NODE=8 NODE_RANK=$NODE_RANK MASTER_ADDR=$MASTER_ADDR MASTER_PORT=$MASTER_PORT IMAGE_MAX_TOKEN_NUM=256 megatron sft --load $MODEL_PATH --dataset $DATA_PATH --save $OUTPUT_DIR --save_interval 300 --max_epochs 10 --tensor_model_parallel_size 4 --pipeline_model_parallel_size 2 --expert_model_parallel_size 8 --expert_tensor_parallel_size 1 --micro_batch_size 1 --global_batch_size 16 --sequence_parallel true --moe_expert_capacity_factor 1 --moe_permute_fusion true --moe_grouped_gemm true --moe_shared_expert_overlap true --recompute_granularity full --recompute_method uniform --recompute_num_layers 1 --finetune true --cross_entropy_loss_fusion true --lr 1e-5 --lr_warmup_fraction 0.05 --min_lr 1e-7 --moe_aux_loss_coeff 1e-6 --max_length 1024 --attention_backend flash --optimizer_cpu_offload true --optimizer_offload_fraction 1 --use_precision_aware_optimizer true --packing true --attn_impl flash_attn --bf16 true --num_workers 8 --no_save_optim true --no_save_rng true --dataset_num_proc 8 --freeze_vit true --freeze_llm false --freeze_aligner false --split_dataset_ratio 0 --log_interval 1

Please try LoRa SFT.

slin000111 avatar Nov 12 '25 03:11 slin000111

I'm using 8×8 A100 GPUs to fine-tune (SFT) the Qwen3-VL-235B-A22B-Instruct model, but I keep encountering out-of-memory (OOM) issues regardless of the settings. Could you please advise me on how to resolve this? I’ve set IMAGE_MAX_TOKEN_NUM to 256, and each query contains at most two images. Based on my estimate, the total number of prompt tokens—including image tokens—should not exceed 1024, given the 256-token limit per image. Thanks #! GQA limit: tensor_model_parallel_size=4 #sft command PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' OMP_NUM_THREADS=14 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 NNODES=8 NPROC_PER_NODE=8 NODE_RANK=$NODE_RANK MASTER_ADDR=$MASTER_ADDR MASTER_PORT=$MASTER_PORT IMAGE_MAX_TOKEN_NUM=256 megatron sft --load $MODEL_PATH --dataset $DATA_PATH --save $OUTPUT_DIR --save_interval 300 --max_epochs 10 --tensor_model_parallel_size 4 --pipeline_model_parallel_size 2 --expert_model_parallel_size 8 --expert_tensor_parallel_size 1 --micro_batch_size 1 --global_batch_size 16 --sequence_parallel true --moe_expert_capacity_factor 1 --moe_permute_fusion true --moe_grouped_gemm true --moe_shared_expert_overlap true --recompute_granularity full --recompute_method uniform --recompute_num_layers 1 --finetune true --cross_entropy_loss_fusion true --lr 1e-5 --lr_warmup_fraction 0.05 --min_lr 1e-7 --moe_aux_loss_coeff 1e-6 --max_length 1024 --attention_backend flash --optimizer_cpu_offload true --optimizer_offload_fraction 1 --use_precision_aware_optimizer true --packing true --attn_impl flash_attn --bf16 true --num_workers 8 --no_save_optim true --no_save_rng true --dataset_num_proc 8 --freeze_vit true --freeze_llm false --freeze_aligner false --split_dataset_ratio 0 --log_interval 1

I'm using 8×8 A100 GPUs to fine-tune (SFT) the Qwen3-VL-235B-A22B-Instruct model, but I keep encountering out-of-memory (OOM) issues regardless of the settings. Could you please advise me on how to resolve this? I’ve set IMAGE_MAX_TOKEN_NUM to 256, and each query contains at most two images. Based on my estimate, the total number of prompt tokens—including image tokens—should not exceed 1024, given the 256-token limit per image. Thanks #! GQA limit: tensor_model_parallel_size=4 #sft command PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' OMP_NUM_THREADS=14 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 NNODES=8 NPROC_PER_NODE=8 NODE_RANK=$NODE_RANK MASTER_ADDR=$MASTER_ADDR MASTER_PORT=$MASTER_PORT IMAGE_MAX_TOKEN_NUM=256 megatron sft --load $MODEL_PATH --dataset $DATA_PATH --save $OUTPUT_DIR --save_interval 300 --max_epochs 10 --tensor_model_parallel_size 4 --pipeline_model_parallel_size 2 --expert_model_parallel_size 8 --expert_tensor_parallel_size 1 --micro_batch_size 1 --global_batch_size 16 --sequence_parallel true --moe_expert_capacity_factor 1 --moe_permute_fusion true --moe_grouped_gemm true --moe_shared_expert_overlap true --recompute_granularity full --recompute_method uniform --recompute_num_layers 1 --finetune true --cross_entropy_loss_fusion true --lr 1e-5 --lr_warmup_fraction 0.05 --min_lr 1e-7 --moe_aux_loss_coeff 1e-6 --max_length 1024 --attention_backend flash --optimizer_cpu_offload true --optimizer_offload_fraction 1 --use_precision_aware_optimizer true --packing true --attn_impl flash_attn --bf16 true --num_workers 8 --no_save_optim true --no_save_rng true --dataset_num_proc 8 --freeze_vit true --freeze_llm false --freeze_aligner false --split_dataset_ratio 0 --log_interval 1

Please try LoRa SFT.

Thank you for your response. We intend to perform full-parameter SFT. We ran the same configuration on the Qwen3-235B-A22B model without issues; however, when using the Qwen3-VL-235B-A22B model we consistently encounter out-of-memory (OOM) errors. Do you have any recommendations to resolve this?

Li-Jicheng avatar Nov 12 '25 06:11 Li-Jicheng

I'm using 8×8 A100 GPUs to fine-tune (SFT) the Qwen3-VL-235B-A22B-Instruct model, but I keep encountering out-of-memory (OOM) issues regardless of the settings. Could you please advise me on how to resolve this? I’ve set IMAGE_MAX_TOKEN_NUM to 256, and each query contains at most two images. Based on my estimate, the total number of prompt tokens—including image tokens—should not exceed 1024, given the 256-token limit per image. Thanks #! GQA limit: tensor_model_parallel_size=4 #sft command PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' OMP_NUM_THREADS=14 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 NNODES=8 NPROC_PER_NODE=8 NODE_RANK=$NODE_RANK MASTER_ADDR=$MASTER_ADDR MASTER_PORT=$MASTER_PORT IMAGE_MAX_TOKEN_NUM=256 megatron sft --load $MODEL_PATH --dataset $DATA_PATH --save $OUTPUT_DIR --save_interval 300 --max_epochs 10 --tensor_model_parallel_size 4 --pipeline_model_parallel_size 2 --expert_model_parallel_size 8 --expert_tensor_parallel_size 1 --micro_batch_size 1 --global_batch_size 16 --sequence_parallel true --moe_expert_capacity_factor 1 --moe_permute_fusion true --moe_grouped_gemm true --moe_shared_expert_overlap true --recompute_granularity full --recompute_method uniform --recompute_num_layers 1 --finetune true --cross_entropy_loss_fusion true --lr 1e-5 --lr_warmup_fraction 0.05 --min_lr 1e-7 --moe_aux_loss_coeff 1e-6 --max_length 1024 --attention_backend flash --optimizer_cpu_offload true --optimizer_offload_fraction 1 --use_precision_aware_optimizer true --packing true --attn_impl flash_attn --bf16 true --num_workers 8 --no_save_optim true --no_save_rng true --dataset_num_proc 8 --freeze_vit true --freeze_llm false --freeze_aligner false --split_dataset_ratio 0 --log_interval 1

I'm using 8×8 A100 GPUs to fine-tune (SFT) the Qwen3-VL-235B-A22B-Instruct model, but I keep encountering out-of-memory (OOM) issues regardless of the settings. Could you please advise me on how to resolve this? I’ve set IMAGE_MAX_TOKEN_NUM to 256, and each query contains at most two images. Based on my estimate, the total number of prompt tokens—including image tokens—should not exceed 1024, given the 256-token limit per image. Thanks #! GQA limit: tensor_model_parallel_size=4 #sft command PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' OMP_NUM_THREADS=14 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 NNODES=8 NPROC_PER_NODE=8 NODE_RANK=$NODE_RANK MASTER_ADDR=$MASTER_ADDR MASTER_PORT=$MASTER_PORT IMAGE_MAX_TOKEN_NUM=256 megatron sft --load $MODEL_PATH --dataset $DATA_PATH --save $OUTPUT_DIR --save_interval 300 --max_epochs 10 --tensor_model_parallel_size 4 --pipeline_model_parallel_size 2 --expert_model_parallel_size 8 --expert_tensor_parallel_size 1 --micro_batch_size 1 --global_batch_size 16 --sequence_parallel true --moe_expert_capacity_factor 1 --moe_permute_fusion true --moe_grouped_gemm true --moe_shared_expert_overlap true --recompute_granularity full --recompute_method uniform --recompute_num_layers 1 --finetune true --cross_entropy_loss_fusion true --lr 1e-5 --lr_warmup_fraction 0.05 --min_lr 1e-7 --moe_aux_loss_coeff 1e-6 --max_length 1024 --attention_backend flash --optimizer_cpu_offload true --optimizer_offload_fraction 1 --use_precision_aware_optimizer true --packing true --attn_impl flash_attn --bf16 true --num_workers 8 --no_save_optim true --no_save_rng true --dataset_num_proc 8 --freeze_vit true --freeze_llm false --freeze_aligner false --split_dataset_ratio 0 --log_interval 1

Please try LoRa SFT.

Thank you for your response. We intend to perform full-parameter SFT. We ran the same configuration on the Qwen3-235B-A22B model without issues; however, when using the Qwen3-VL-235B-A22B model we consistently encounter out-of-memory (OOM) errors. Do you have any recommendations to resolve this?

When training a plain text model, there may be little GPU memory remaining, while VL models consume more GPU memory. For multimodal models, perhaps we can first run the training of the 30B-A3B model and then evaluate whether we can perform full-parameter training on the 235B-A22B model.

slin000111 avatar Nov 12 '25 07:11 slin000111

I'm using 8×8 A100 GPUs to fine-tune (SFT) the Qwen3-VL-235B-A22B-Instruct model, but I keep encountering out-of-memory (OOM) issues regardless of the settings. Could you please advise me on how to resolve this? I’ve set IMAGE_MAX_TOKEN_NUM to 256, and each query contains at most two images. Based on my estimate, the total number of prompt tokens—including image tokens—should not exceed 1024, given the 256-token limit per image. Thanks #! GQA limit: tensor_model_parallel_size=4 #sft command PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' OMP_NUM_THREADS=14 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 NNODES=8 NPROC_PER_NODE=8 NODE_RANK=$NODE_RANK MASTER_ADDR=$MASTER_ADDR MASTER_PORT=$MASTER_PORT IMAGE_MAX_TOKEN_NUM=256 megatron sft --load $MODEL_PATH --dataset $DATA_PATH --save $OUTPUT_DIR --save_interval 300 --max_epochs 10 --tensor_model_parallel_size 4 --pipeline_model_parallel_size 2 --expert_model_parallel_size 8 --expert_tensor_parallel_size 1 --micro_batch_size 1 --global_batch_size 16 --sequence_parallel true --moe_expert_capacity_factor 1 --moe_permute_fusion true --moe_grouped_gemm true --moe_shared_expert_overlap true --recompute_granularity full --recompute_method uniform --recompute_num_layers 1 --finetune true --cross_entropy_loss_fusion true --lr 1e-5 --lr_warmup_fraction 0.05 --min_lr 1e-7 --moe_aux_loss_coeff 1e-6 --max_length 1024 --attention_backend flash --optimizer_cpu_offload true --optimizer_offload_fraction 1 --use_precision_aware_optimizer true --packing true --attn_impl flash_attn --bf16 true --num_workers 8 --no_save_optim true --no_save_rng true --dataset_num_proc 8 --freeze_vit true --freeze_llm false --freeze_aligner false --split_dataset_ratio 0 --log_interval 1

I'm using 8×8 A100 GPUs to fine-tune (SFT) the Qwen3-VL-235B-A22B-Instruct model, but I keep encountering out-of-memory (OOM) issues regardless of the settings. Could you please advise me on how to resolve this? I’ve set IMAGE_MAX_TOKEN_NUM to 256, and each query contains at most two images. Based on my estimate, the total number of prompt tokens—including image tokens—should not exceed 1024, given the 256-token limit per image. Thanks #! GQA limit: tensor_model_parallel_size=4 #sft command PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' OMP_NUM_THREADS=14 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 NNODES=8 NPROC_PER_NODE=8 NODE_RANK=$NODE_RANK MASTER_ADDR=$MASTER_ADDR MASTER_PORT=$MASTER_PORT IMAGE_MAX_TOKEN_NUM=256 megatron sft --load $MODEL_PATH --dataset $DATA_PATH --save $OUTPUT_DIR --save_interval 300 --max_epochs 10 --tensor_model_parallel_size 4 --pipeline_model_parallel_size 2 --expert_model_parallel_size 8 --expert_tensor_parallel_size 1 --micro_batch_size 1 --global_batch_size 16 --sequence_parallel true --moe_expert_capacity_factor 1 --moe_permute_fusion true --moe_grouped_gemm true --moe_shared_expert_overlap true --recompute_granularity full --recompute_method uniform --recompute_num_layers 1 --finetune true --cross_entropy_loss_fusion true --lr 1e-5 --lr_warmup_fraction 0.05 --min_lr 1e-7 --moe_aux_loss_coeff 1e-6 --max_length 1024 --attention_backend flash --optimizer_cpu_offload true --optimizer_offload_fraction 1 --use_precision_aware_optimizer true --packing true --attn_impl flash_attn --bf16 true --num_workers 8 --no_save_optim true --no_save_rng true --dataset_num_proc 8 --freeze_vit true --freeze_llm false --freeze_aligner false --split_dataset_ratio 0 --log_interval 1

Please try LoRa SFT.

Thank you for your response. We intend to perform full-parameter SFT. We ran the same configuration on the Qwen3-235B-A22B model without issues; however, when using the Qwen3-VL-235B-A22B model we consistently encounter out-of-memory (OOM) errors. Do you have any recommendations to resolve this?

When training a plain text model, there may be little GPU memory remaining, while VL models consume more GPU memory. For multimodal models, perhaps we can first run the training of the 30B-A3B model and then evaluate whether we can perform full-parameter training on the 235B-A22B model.

Thank you for your response, but honestly my question hasn't been fully addressed. My goal is to full-parameter finetune (FFT) Qwen3-VL-235B model using 64 A100 (80GB) GPUs. I'm wondering whether ms-swift supports this setup, as theoretically memory shouldn't be a roadblock.

By the way, we've successfully run experiments on Qwen3-VL-30B without any issues, but that's beyond the scope of current discussion.

Li-Jicheng avatar Nov 12 '25 07:11 Li-Jicheng

try use tp4ep8pp8 or tp4ep8pp4

Jintao-Huang avatar Nov 12 '25 07:11 Jintao-Huang

try use tp4ep8pp8 or tp4ep8pp4

Thanks for your reply. Based on your configuration, the total GPU requirement should be tp × ep × pp = 256 or 128 GPUs, correct? My setup is limited to 64 GPUs, which should be sufficient for full fine-tuning (FFT) of a 235B model.

I’ve also tested tp4ep4pp4, but the same OOM issue persists.

Li-Jicheng avatar Nov 12 '25 11:11 Li-Jicheng