Clarification on train_group_size and GPU Utilization for Negative Samples in Latest Version
I am currently attempting to fine-tune bge-m3 and have been referring to the following documentation: https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune/embedder#2-bge-m3
-
The default value for
train_group_sizewas previously set to 2 but has now been increased to 8 in the current version of the code. What is the reasoning behind this change, and what potential benefits or drawbacks should I be aware of? -
Additionally, could you confirm whether the current implementation utilizes all available GPUs and considers all passages within the batch as negative samples by default? I want to ensure that my understanding is correct.
Thank you!
- During fine-tuning, the number of hard negatives for each query is train_group_size - 1, so a larger train_group_size is better.
- During fine-tuning, all in-batch passages across all GPUs are used as negatives.
- During fine-tuning, the number of hard negatives for each query is train_group_size - 1, so a larger train_group_size is better.
- During fine-tuning, all in-batch passages across all GPUs are used as negatives.
Completely understood! Thank you very much.
Hi @545999961, Consider the following example:
Query A: Has hard negative passages defined in the "neg" column (e.g., a specific passage known to be challenging).
Query B: Its positive passage is used as an in-batch negative for Query A.
Does the script combine both the hard negative from Query A’s "neg" column and the in-batch negative from Query B to form the complete set of negative samples for Query A? Or does it prioritize one over the other (e.g., using only the positive passage from Query B or using only the hard negatives from Query A)?
%%bash
torchrun --nproc_per_node 8 \
-m FlagEmbedding.finetune.embedder.decoder_only.base \
--model_name_or_path BAAI/bge-base-en-v1.5 \
--cache_dir ./cache/model \
--train_data ./msmarco/data/passage/training.json \
--cache_path ./cache/data \
--train_group_size 32 \
--query_max_len 256 \
--passage_max_len 512 \
--pad_to_multiple_of 8 \
--query_instruction_for_retrieval 'Represent this sentence for searching relevant passages: ' \
--query_instruction_format '<instruct>{}\n<query>{}' \
--knowledge_distillation False \
--output_dir ./test_bge-base-en-v1.5 \
--overwrite_output_dir \
--learning_rate 1e-5 \
--fp16 \
--num_train_epochs 1 \
--per_device_train_batch_size 16 \
--dataloader_drop_last True \
--warmup_ratio 0.1 \
--gradient_checkpointing \
--deepspeed config/ds_stage0.json \
--logging_steps 1 \
--save_steps 1000 \
--negatives_cross_device True \
--temperature 0.02 \
--sentence_pooling_method last_token \
--normalize_embeddings True \
--kd_loss_type kl_div \
--use_lora True \
--lora_rank 32 \
--lora_alpha 64 \
--target_modules q_proj k_proj v_proj o_proj gate_proj down_proj up_proj \
--save_merged_lora_model True \