Qwen3-VL-235B-A22B SFT OOM issue
I'm using 8×8 A100 GPUs to fine-tune (SFT) the Qwen3-VL-235B-A22B-Instruct model, but I keep encountering out-of-memory (OOM) issues regardless of the settings. Could you please advise me on how to resolve this? I’ve set IMAGE_MAX_TOKEN_NUM to 256, and each query contains at most two images. Based on my estimate, the total number of prompt tokens—including image tokens—should not exceed 1024, given the 256-token limit per image. Thanks
#! GQA limit: tensor_model_parallel_size=4
#sft command
PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True'
OMP_NUM_THREADS=14
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
NNODES=8
NPROC_PER_NODE=8
NODE_RANK=$NODE_RANK
MASTER_ADDR=$MASTER_ADDR
MASTER_PORT=$MASTER_PORT
IMAGE_MAX_TOKEN_NUM=256
megatron sft
--load $MODEL_PATH
--dataset $DATA_PATH
--save $OUTPUT_DIR
--save_interval 300
--max_epochs 10
--tensor_model_parallel_size 4
--pipeline_model_parallel_size 2
--expert_model_parallel_size 8
--expert_tensor_parallel_size 1
--micro_batch_size 1
--global_batch_size 16
--sequence_parallel true
--moe_expert_capacity_factor 1
--moe_permute_fusion true
--moe_grouped_gemm true
--moe_shared_expert_overlap true
--recompute_granularity full
--recompute_method uniform
--recompute_num_layers 1
--finetune true
--cross_entropy_loss_fusion true
--lr 1e-5
--lr_warmup_fraction 0.05
--min_lr 1e-7
--moe_aux_loss_coeff 1e-6
--max_length 1024
--attention_backend flash
--optimizer_cpu_offload true
--optimizer_offload_fraction 1
--use_precision_aware_optimizer true
--packing true
--attn_impl flash_attn
--bf16 true
--num_workers 8
--no_save_optim true
--no_save_rng true
--dataset_num_proc 8
--freeze_vit true
--freeze_llm false
--freeze_aligner false
--split_dataset_ratio 0
--log_interval 1
I'm using 8×8 A100 GPUs to fine-tune (SFT) the Qwen3-VL-235B-A22B-Instruct model, but I keep encountering out-of-memory (OOM) issues regardless of the settings. Could you please advise me on how to resolve this? I’ve set IMAGE_MAX_TOKEN_NUM to 256, and each query contains at most two images. Based on my estimate, the total number of prompt tokens—including image tokens—should not exceed 1024, given the 256-token limit per image. Thanks
#! GQA limit: tensor_model_parallel_size=4 #sft command PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' OMP_NUM_THREADS=14 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 NNODES=8 NPROC_PER_NODE=8 NODE_RANK=$NODE_RANK MASTER_ADDR=$MASTER_ADDR MASTER_PORT=$MASTER_PORT IMAGE_MAX_TOKEN_NUM=256 megatron sft --load $MODEL_PATH --dataset $DATA_PATH --save $OUTPUT_DIR --save_interval 300 --max_epochs 10 --tensor_model_parallel_size 4 --pipeline_model_parallel_size 2 --expert_model_parallel_size 8 --expert_tensor_parallel_size 1 --micro_batch_size 1 --global_batch_size 16 --sequence_parallel true --moe_expert_capacity_factor 1 --moe_permute_fusion true --moe_grouped_gemm true --moe_shared_expert_overlap true --recompute_granularity full --recompute_method uniform --recompute_num_layers 1 --finetune true --cross_entropy_loss_fusion true --lr 1e-5 --lr_warmup_fraction 0.05 --min_lr 1e-7 --moe_aux_loss_coeff 1e-6 --max_length 1024 --attention_backend flash --optimizer_cpu_offload true --optimizer_offload_fraction 1 --use_precision_aware_optimizer true --packing true --attn_impl flash_attn --bf16 true --num_workers 8 --no_save_optim true --no_save_rng true --dataset_num_proc 8 --freeze_vit true --freeze_llm false --freeze_aligner false --split_dataset_ratio 0 --log_interval 1
I'm using 8×8 A100 GPUs to fine-tune (SFT) the Qwen3-VL-235B-A22B-Instruct model, but I keep encountering out-of-memory (OOM) issues regardless of the settings. Could you please advise me on how to resolve this? I’ve set IMAGE_MAX_TOKEN_NUM to 256, and each query contains at most two images. Based on my estimate, the total number of prompt tokens—including image tokens—should not exceed 1024, given the 256-token limit per image. Thanks
#! GQA limit: tensor_model_parallel_size=4 #sft command PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' OMP_NUM_THREADS=14 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 NNODES=8 NPROC_PER_NODE=8 NODE_RANK=$NODE_RANK MASTER_ADDR=$MASTER_ADDR MASTER_PORT=$MASTER_PORT IMAGE_MAX_TOKEN_NUM=256 megatron sft --load $MODEL_PATH --dataset $DATA_PATH --save $OUTPUT_DIR --save_interval 300 --max_epochs 10 --tensor_model_parallel_size 4 --pipeline_model_parallel_size 2 --expert_model_parallel_size 8 --expert_tensor_parallel_size 1 --micro_batch_size 1 --global_batch_size 16 --sequence_parallel true --moe_expert_capacity_factor 1 --moe_permute_fusion true --moe_grouped_gemm true --moe_shared_expert_overlap true --recompute_granularity full --recompute_method uniform --recompute_num_layers 1 --finetune true --cross_entropy_loss_fusion true --lr 1e-5 --lr_warmup_fraction 0.05 --min_lr 1e-7 --moe_aux_loss_coeff 1e-6 --max_length 1024 --attention_backend flash --optimizer_cpu_offload true --optimizer_offload_fraction 1 --use_precision_aware_optimizer true --packing true --attn_impl flash_attn --bf16 true --num_workers 8 --no_save_optim true --no_save_rng true --dataset_num_proc 8 --freeze_vit true --freeze_llm false --freeze_aligner false --split_dataset_ratio 0 --log_interval 1
Please try LoRa SFT.
I'm using 8×8 A100 GPUs to fine-tune (SFT) the Qwen3-VL-235B-A22B-Instruct model, but I keep encountering out-of-memory (OOM) issues regardless of the settings. Could you please advise me on how to resolve this? I’ve set IMAGE_MAX_TOKEN_NUM to 256, and each query contains at most two images. Based on my estimate, the total number of prompt tokens—including image tokens—should not exceed 1024, given the 256-token limit per image. Thanks #! GQA limit: tensor_model_parallel_size=4 #sft command PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' OMP_NUM_THREADS=14 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 NNODES=8 NPROC_PER_NODE=8 NODE_RANK=$NODE_RANK MASTER_ADDR=$MASTER_ADDR MASTER_PORT=$MASTER_PORT IMAGE_MAX_TOKEN_NUM=256 megatron sft --load $MODEL_PATH --dataset $DATA_PATH --save $OUTPUT_DIR --save_interval 300 --max_epochs 10 --tensor_model_parallel_size 4 --pipeline_model_parallel_size 2 --expert_model_parallel_size 8 --expert_tensor_parallel_size 1 --micro_batch_size 1 --global_batch_size 16 --sequence_parallel true --moe_expert_capacity_factor 1 --moe_permute_fusion true --moe_grouped_gemm true --moe_shared_expert_overlap true --recompute_granularity full --recompute_method uniform --recompute_num_layers 1 --finetune true --cross_entropy_loss_fusion true --lr 1e-5 --lr_warmup_fraction 0.05 --min_lr 1e-7 --moe_aux_loss_coeff 1e-6 --max_length 1024 --attention_backend flash --optimizer_cpu_offload true --optimizer_offload_fraction 1 --use_precision_aware_optimizer true --packing true --attn_impl flash_attn --bf16 true --num_workers 8 --no_save_optim true --no_save_rng true --dataset_num_proc 8 --freeze_vit true --freeze_llm false --freeze_aligner false --split_dataset_ratio 0 --log_interval 1
I'm using 8×8 A100 GPUs to fine-tune (SFT) the Qwen3-VL-235B-A22B-Instruct model, but I keep encountering out-of-memory (OOM) issues regardless of the settings. Could you please advise me on how to resolve this? I’ve set IMAGE_MAX_TOKEN_NUM to 256, and each query contains at most two images. Based on my estimate, the total number of prompt tokens—including image tokens—should not exceed 1024, given the 256-token limit per image. Thanks #! GQA limit: tensor_model_parallel_size=4 #sft command PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' OMP_NUM_THREADS=14 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 NNODES=8 NPROC_PER_NODE=8 NODE_RANK=$NODE_RANK MASTER_ADDR=$MASTER_ADDR MASTER_PORT=$MASTER_PORT IMAGE_MAX_TOKEN_NUM=256 megatron sft --load $MODEL_PATH --dataset $DATA_PATH --save $OUTPUT_DIR --save_interval 300 --max_epochs 10 --tensor_model_parallel_size 4 --pipeline_model_parallel_size 2 --expert_model_parallel_size 8 --expert_tensor_parallel_size 1 --micro_batch_size 1 --global_batch_size 16 --sequence_parallel true --moe_expert_capacity_factor 1 --moe_permute_fusion true --moe_grouped_gemm true --moe_shared_expert_overlap true --recompute_granularity full --recompute_method uniform --recompute_num_layers 1 --finetune true --cross_entropy_loss_fusion true --lr 1e-5 --lr_warmup_fraction 0.05 --min_lr 1e-7 --moe_aux_loss_coeff 1e-6 --max_length 1024 --attention_backend flash --optimizer_cpu_offload true --optimizer_offload_fraction 1 --use_precision_aware_optimizer true --packing true --attn_impl flash_attn --bf16 true --num_workers 8 --no_save_optim true --no_save_rng true --dataset_num_proc 8 --freeze_vit true --freeze_llm false --freeze_aligner false --split_dataset_ratio 0 --log_interval 1
Please try LoRa SFT.
Thank you for your response. We intend to perform full-parameter SFT. We ran the same configuration on the Qwen3-235B-A22B model without issues; however, when using the Qwen3-VL-235B-A22B model we consistently encounter out-of-memory (OOM) errors. Do you have any recommendations to resolve this?
I'm using 8×8 A100 GPUs to fine-tune (SFT) the Qwen3-VL-235B-A22B-Instruct model, but I keep encountering out-of-memory (OOM) issues regardless of the settings. Could you please advise me on how to resolve this? I’ve set IMAGE_MAX_TOKEN_NUM to 256, and each query contains at most two images. Based on my estimate, the total number of prompt tokens—including image tokens—should not exceed 1024, given the 256-token limit per image. Thanks #! GQA limit: tensor_model_parallel_size=4 #sft command PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' OMP_NUM_THREADS=14 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 NNODES=8 NPROC_PER_NODE=8 NODE_RANK=$NODE_RANK MASTER_ADDR=$MASTER_ADDR MASTER_PORT=$MASTER_PORT IMAGE_MAX_TOKEN_NUM=256 megatron sft --load $MODEL_PATH --dataset $DATA_PATH --save $OUTPUT_DIR --save_interval 300 --max_epochs 10 --tensor_model_parallel_size 4 --pipeline_model_parallel_size 2 --expert_model_parallel_size 8 --expert_tensor_parallel_size 1 --micro_batch_size 1 --global_batch_size 16 --sequence_parallel true --moe_expert_capacity_factor 1 --moe_permute_fusion true --moe_grouped_gemm true --moe_shared_expert_overlap true --recompute_granularity full --recompute_method uniform --recompute_num_layers 1 --finetune true --cross_entropy_loss_fusion true --lr 1e-5 --lr_warmup_fraction 0.05 --min_lr 1e-7 --moe_aux_loss_coeff 1e-6 --max_length 1024 --attention_backend flash --optimizer_cpu_offload true --optimizer_offload_fraction 1 --use_precision_aware_optimizer true --packing true --attn_impl flash_attn --bf16 true --num_workers 8 --no_save_optim true --no_save_rng true --dataset_num_proc 8 --freeze_vit true --freeze_llm false --freeze_aligner false --split_dataset_ratio 0 --log_interval 1
I'm using 8×8 A100 GPUs to fine-tune (SFT) the Qwen3-VL-235B-A22B-Instruct model, but I keep encountering out-of-memory (OOM) issues regardless of the settings. Could you please advise me on how to resolve this? I’ve set IMAGE_MAX_TOKEN_NUM to 256, and each query contains at most two images. Based on my estimate, the total number of prompt tokens—including image tokens—should not exceed 1024, given the 256-token limit per image. Thanks #! GQA limit: tensor_model_parallel_size=4 #sft command PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' OMP_NUM_THREADS=14 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 NNODES=8 NPROC_PER_NODE=8 NODE_RANK=$NODE_RANK MASTER_ADDR=$MASTER_ADDR MASTER_PORT=$MASTER_PORT IMAGE_MAX_TOKEN_NUM=256 megatron sft --load $MODEL_PATH --dataset $DATA_PATH --save $OUTPUT_DIR --save_interval 300 --max_epochs 10 --tensor_model_parallel_size 4 --pipeline_model_parallel_size 2 --expert_model_parallel_size 8 --expert_tensor_parallel_size 1 --micro_batch_size 1 --global_batch_size 16 --sequence_parallel true --moe_expert_capacity_factor 1 --moe_permute_fusion true --moe_grouped_gemm true --moe_shared_expert_overlap true --recompute_granularity full --recompute_method uniform --recompute_num_layers 1 --finetune true --cross_entropy_loss_fusion true --lr 1e-5 --lr_warmup_fraction 0.05 --min_lr 1e-7 --moe_aux_loss_coeff 1e-6 --max_length 1024 --attention_backend flash --optimizer_cpu_offload true --optimizer_offload_fraction 1 --use_precision_aware_optimizer true --packing true --attn_impl flash_attn --bf16 true --num_workers 8 --no_save_optim true --no_save_rng true --dataset_num_proc 8 --freeze_vit true --freeze_llm false --freeze_aligner false --split_dataset_ratio 0 --log_interval 1
Please try LoRa SFT.
Thank you for your response. We intend to perform full-parameter SFT. We ran the same configuration on the Qwen3-235B-A22B model without issues; however, when using the Qwen3-VL-235B-A22B model we consistently encounter out-of-memory (OOM) errors. Do you have any recommendations to resolve this?
When training a plain text model, there may be little GPU memory remaining, while VL models consume more GPU memory. For multimodal models, perhaps we can first run the training of the 30B-A3B model and then evaluate whether we can perform full-parameter training on the 235B-A22B model.
I'm using 8×8 A100 GPUs to fine-tune (SFT) the Qwen3-VL-235B-A22B-Instruct model, but I keep encountering out-of-memory (OOM) issues regardless of the settings. Could you please advise me on how to resolve this? I’ve set IMAGE_MAX_TOKEN_NUM to 256, and each query contains at most two images. Based on my estimate, the total number of prompt tokens—including image tokens—should not exceed 1024, given the 256-token limit per image. Thanks #! GQA limit: tensor_model_parallel_size=4 #sft command PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' OMP_NUM_THREADS=14 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 NNODES=8 NPROC_PER_NODE=8 NODE_RANK=$NODE_RANK MASTER_ADDR=$MASTER_ADDR MASTER_PORT=$MASTER_PORT IMAGE_MAX_TOKEN_NUM=256 megatron sft --load $MODEL_PATH --dataset $DATA_PATH --save $OUTPUT_DIR --save_interval 300 --max_epochs 10 --tensor_model_parallel_size 4 --pipeline_model_parallel_size 2 --expert_model_parallel_size 8 --expert_tensor_parallel_size 1 --micro_batch_size 1 --global_batch_size 16 --sequence_parallel true --moe_expert_capacity_factor 1 --moe_permute_fusion true --moe_grouped_gemm true --moe_shared_expert_overlap true --recompute_granularity full --recompute_method uniform --recompute_num_layers 1 --finetune true --cross_entropy_loss_fusion true --lr 1e-5 --lr_warmup_fraction 0.05 --min_lr 1e-7 --moe_aux_loss_coeff 1e-6 --max_length 1024 --attention_backend flash --optimizer_cpu_offload true --optimizer_offload_fraction 1 --use_precision_aware_optimizer true --packing true --attn_impl flash_attn --bf16 true --num_workers 8 --no_save_optim true --no_save_rng true --dataset_num_proc 8 --freeze_vit true --freeze_llm false --freeze_aligner false --split_dataset_ratio 0 --log_interval 1
I'm using 8×8 A100 GPUs to fine-tune (SFT) the Qwen3-VL-235B-A22B-Instruct model, but I keep encountering out-of-memory (OOM) issues regardless of the settings. Could you please advise me on how to resolve this? I’ve set IMAGE_MAX_TOKEN_NUM to 256, and each query contains at most two images. Based on my estimate, the total number of prompt tokens—including image tokens—should not exceed 1024, given the 256-token limit per image. Thanks #! GQA limit: tensor_model_parallel_size=4 #sft command PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' OMP_NUM_THREADS=14 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 NNODES=8 NPROC_PER_NODE=8 NODE_RANK=$NODE_RANK MASTER_ADDR=$MASTER_ADDR MASTER_PORT=$MASTER_PORT IMAGE_MAX_TOKEN_NUM=256 megatron sft --load $MODEL_PATH --dataset $DATA_PATH --save $OUTPUT_DIR --save_interval 300 --max_epochs 10 --tensor_model_parallel_size 4 --pipeline_model_parallel_size 2 --expert_model_parallel_size 8 --expert_tensor_parallel_size 1 --micro_batch_size 1 --global_batch_size 16 --sequence_parallel true --moe_expert_capacity_factor 1 --moe_permute_fusion true --moe_grouped_gemm true --moe_shared_expert_overlap true --recompute_granularity full --recompute_method uniform --recompute_num_layers 1 --finetune true --cross_entropy_loss_fusion true --lr 1e-5 --lr_warmup_fraction 0.05 --min_lr 1e-7 --moe_aux_loss_coeff 1e-6 --max_length 1024 --attention_backend flash --optimizer_cpu_offload true --optimizer_offload_fraction 1 --use_precision_aware_optimizer true --packing true --attn_impl flash_attn --bf16 true --num_workers 8 --no_save_optim true --no_save_rng true --dataset_num_proc 8 --freeze_vit true --freeze_llm false --freeze_aligner false --split_dataset_ratio 0 --log_interval 1
Please try LoRa SFT.
Thank you for your response. We intend to perform full-parameter SFT. We ran the same configuration on the Qwen3-235B-A22B model without issues; however, when using the Qwen3-VL-235B-A22B model we consistently encounter out-of-memory (OOM) errors. Do you have any recommendations to resolve this?
When training a plain text model, there may be little GPU memory remaining, while VL models consume more GPU memory. For multimodal models, perhaps we can first run the training of the 30B-A3B model and then evaluate whether we can perform full-parameter training on the 235B-A22B model.
Thank you for your response, but honestly my question hasn't been fully addressed. My goal is to full-parameter finetune (FFT) Qwen3-VL-235B model using 64 A100 (80GB) GPUs. I'm wondering whether ms-swift supports this setup, as theoretically memory shouldn't be a roadblock.
By the way, we've successfully run experiments on Qwen3-VL-30B without any issues, but that's beyond the scope of current discussion.
try use tp4ep8pp8 or tp4ep8pp4
try use tp4ep8pp8 or tp4ep8pp4
Thanks for your reply. Based on your configuration, the total GPU requirement should be tp × ep × pp = 256 or 128 GPUs, correct? My setup is limited to 64 GPUs, which should be sufficient for full fine-tuning (FFT) of a 235B model.
I’ve also tested tp4ep4pp4, but the same OOM issue persists.