gritlm
gritlm copied to clipboard
Severe performance degradation during Mistral fine-tuning
Hi,
I’m working with the GRITLM repository. And I'm training Mistral 7B and evaluating performance on the MTEB benchmark with NVIDIA RTX A6000. I first tested the pretrained mistralai/Mistral-7B-v0.1 model directly, and the results aligned well with the reported values in your paper.
However, when I fine-tuned Mistral using the training.run script with the --mode unified setting and evaluated intermediate checkpoints (at step 270 and step 950 out of 1253 total), I observed a drastic performance drop across all tasks (see attached image). The average scores on classification, clustering, and reranking tasks are far below expectations.
Here is my training command:
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \ CUDA_VISIBLE_DEVICES=4 \ WANDB_MODE=online \ torchrun --nproc_per_node=1 -m training.run \ --output_dir finetuned/m7_nodes1_250604 \ --model_name_or_path mistralai/Mistral-7B-v0.1 \ --train_data training/data_instruct \ --learning_rate 2e-5 \ --lr_scheduler_type linear \ --warmup_ratio 0.03 \ --max_steps 1253 \ --per_device_train_batch_size 16 \ --gradient_accumulation_steps 64 \ --per_device_generative_bs 4 \ --dataloader_num_workers 8 \ --dataloader_drop_last \ --normalized \ --temperature 0.2 \ --train_group_size 2 \ --negatives_cross_device \ --query_max_len 256 \ --passage_max_len 2048 \ --mode unified \ --logging_steps 1 \ --bf16 \ --pooling_method mean \ --use_unique_indices \ --loss_gen_type mixed \ --attn bbcc \ --attn_implementation sdpa \ --gradient_checkpointing \ --save_steps 10 \ --lora
I’m wondering:
- Is this level of degradation expected at early checkpoints?
- Could the --loss_gen_type mixed or any other flag cause instability this early?
- Are there any best practices for early stopping checkpoints, or warmup steps I may have misconfigured?
Any insights or suggestions would be greatly appreciated!
Thanks in advance.