FlagEmbedding
FlagEmbedding copied to clipboard
Finetune BGE-M3
How could I finetune dense and sparse embedding only ? I try to use this script :
%%bash
torchrun --nproc_per_node 1 \
-m FlagEmbedding.finetune.embedder.encoder_only.m3 \
--model_name_or_path /home/alex/ejada/developers/martina/my_cache/models--BAAI--bge-m3 \
--cache_dir ./cache/model \
--train_data ./ft_data/training.json \
--train_group_size 4 \
--query_max_len 256 \
--passage_max_len 256 \
--pad_to_multiple_of 4 \
--query_instruction_for_retrieval 'Represent this sentence for searching relevant passages: ' \
--query_instruction_format '{}{}' \
--knowledge_distillation False \
--output_dir ./test_encoder \
--learning_rate 1e-5 \
--num_train_epochs 1 \
--per_device_train_batch_size 1 \
--dataloader_drop_last True \
--warmup_ratio 0.1 \
--logging_steps 1 \
--save_steps 1000 \
--negatives_cross_device \
--temperature 0.02 \
--sentence_pooling_method cls \
--normalize_embeddings True \
--kd_loss_type m3_kd_loss \
--unified_finetuning True \
--use_self_distill True \
--fix_encoder True \
--colbert_dim 0 \
--self_distill_start_step 0
ColBERT vector and sparse embedding are finetuned together. If you want to remove the colbert vector, you need to remove the code in the finetune module.
Thanks for your answer . It's work. I update loss function to the following : Before :
- return dense_scores + 0.3 * sparse_scores + colbert_scores
- loss = (loss + ensemble_loss + 0.1 * sparse_loss + colbert_loss) / 4
- loss += (dense_self_distill_loss + 0.1 * sparse_self_distill_loss + colbert_self_distill_loss) / 3
After :
- return dense_scores + 0.3 * sparse_scores
- loss = (loss + ensemble_loss + 0.1 * sparse_loss) / 3
- loss += (dense_self_distill_loss + 0.1 * sparse_self_distill_loss) / 2
Is that valid or there is better equation ? should I reduce sparse_score ?