LLaVA-NeXT Loss=0 and grad_norm=nan when fine-tuning llava-v1.5-7b using dpo.sh

Why do I encounter 'loss': 0.0, 'grad_norm': tensor(nan, device='cuda:0', dtype=torch.float64) when fine-tuning llava-v1.5-7b using the dpo code from the llava-next repository? Below is my training script, and I have ensured that my training dataset is fine.

export OMP_NUM_THREADS=8 export NCCL_IB_DISABLE=0 export NCCL_IB_GID_INDEX=3 export NCCL_IB_HCA=${ARNOLD_RDMA_DEVICE} export NCCL_SOCKET_IFNAME=lo export NCCL_DEBUG=INFO

VISION_MODEL_VERSION="openai/clip-vit-large-patch14-336" VISION_MODEL_VERSION_CLEAN="${VISION_MODEL_VERSION////_}" MID_RUN_NAME="llava-1.5-7b-dpo-v1" ############### Pretrain ################

Stage 2 PROMPT_VERSION="v1"

#torchrun --nproc_per_node="${ARNOLD_WORKER_GPU}" --nnodes="${ARNOLD_WORKER_NUM}" --node_rank="${ARNOLD_ID}" --master_addr="${METIS_WORKER_0_HOST}" --master_port="${port_in_cmd}"
ACCELERATE_CPU_AFFINITY=1 torchrun --nproc_per_node=4 --nnodes=1 --node_rank="${RANK}" --master_addr=30.246.96.60 --master_port=23456
llava/train/train_dpo.py
--deepspeed scripts/zero3.json
--model_name_or_path "/model_weight/liuhaotian--llava-v1.5-7b.main.4481d270cc22fd5c4d1bb5df129622006ccd9234"
--version $PROMPT_VERSION
--dpo_alpha 1.0 --beta 0.1 --gamma 0
--data_path=processed_data
--mm_tunable_parts="mm_vision_tower,mm_mlp_adapter,mm_language_model"
--vision_tower ${VISION_MODEL_VERSION}
--mm_projector_type mlp2x_gelu
--mm_vision_select_layer -2
--mm_use_im_start_end False
--mm_use_im_patch_token False
--mm_spatial_pool_stride 2
--mm_resampler_type "spatial_pool"
--mm_spatial_pool_out_channels 1024
--group_by_modality_length True
--image_aspect_ratio pad
--bf16 True
--run_name $MID_RUN_NAME
--output_dir "llava1_5_dpo/${MID_RUN_NAME}"
--num_train_epochs 1
--per_device_train_batch_size 2
--per_device_eval_batch_size 4
--gradient_accumulation_steps 4
--evaluation_strategy "no"
--save_strategy "steps"
--save_steps 3000
--save_total_limit 1
--learning_rate 5e-7
--weight_decay 0.
--warmup_ratio 0.1
--lr_scheduler_type "cosine"
--logging_steps 1
--tf32 True
--model_max_length 32768
--gradient_checkpointing True
--dataloader_num_workers 16
--lazy_preprocess True
--report_to "none"
--torch_compile True
--torch_compile_backend "inductor"
--dataloader_drop_last True

Dec 05 '24 19:12 Na-nata

have you solved this?

Jan 06 '25 15:01 weiaicunzai

@Na-nata and @weiaicunzai could you resolve this issue?

Mar 24 '25 09:03 qm-intel