FlagEmbedding Clarification on train_group_size and GPU Utilization for Negative Samples in Latest Version

I am currently attempting to fine-tune bge-m3 and have been referring to the following documentation: https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune/embedder#2-bge-m3

The default value for train_group_size was previously set to 2 but has now been increased to 8 in the current version of the code. What is the reasoning behind this change, and what potential benefits or drawbacks should I be aware of?
Additionally, could you confirm whether the current implementation utilizes all available GPUs and considers all passages within the batch as negative samples by default? I want to ensure that my understanding is correct.

Thank you!

Nov 18 '24 04:11 zhongxifang

During fine-tuning, the number of hard negatives for each query is train_group_size - 1, so a larger train_group_size is better.
During fine-tuning, all in-batch passages across all GPUs are used as negatives.

Nov 22 '24 03:11 545999961

During fine-tuning, the number of hard negatives for each query is train_group_size - 1, so a larger train_group_size is better.

During fine-tuning, all in-batch passages across all GPUs are used as negatives.

Completely understood! Thank you very much.

Nov 22 '24 06:11 zhongxifang

Hi @545999961, Consider the following example:

Query A: Has hard negative passages defined in the "neg" column (e.g., a specific passage known to be challenging).

Query B: Its positive passage is used as an in-batch negative for Query A.

Does the script combine both the hard negative from Query A’s "neg" column and the in-batch negative from Query B to form the complete set of negative samples for Query A? Or does it prioritize one over the other (e.g., using only the positive passage from Query B or using only the hard negatives from Query A)?

%%bash
torchrun --nproc_per_node 8 \
	-m FlagEmbedding.finetune.embedder.decoder_only.base \
	--model_name_or_path BAAI/bge-base-en-v1.5 \
    --cache_dir ./cache/model \
    --train_data ./msmarco/data/passage/training.json \
    --cache_path ./cache/data \
    --train_group_size 32 \
    --query_max_len 256 \
    --passage_max_len 512 \
    --pad_to_multiple_of 8 \
    --query_instruction_for_retrieval 'Represent this sentence for searching relevant passages: ' \
    --query_instruction_format '<instruct>{}\n<query>{}' \
    --knowledge_distillation False \
	--output_dir ./test_bge-base-en-v1.5 \
    --overwrite_output_dir \
    --learning_rate 1e-5 \
    --fp16 \
    --num_train_epochs 1 \
    --per_device_train_batch_size 16 \
    --dataloader_drop_last True \
    --warmup_ratio 0.1 \
    --gradient_checkpointing \
    --deepspeed config/ds_stage0.json \
    --logging_steps 1 \
    --save_steps 1000 \
    --negatives_cross_device True \
    --temperature 0.02 \
    --sentence_pooling_method last_token \
    --normalize_embeddings True \
    --kd_loss_type kl_div \
    --use_lora True \
    --lora_rank 32 \
    --lora_alpha 64 \
    --target_modules q_proj k_proj v_proj o_proj gate_proj down_proj up_proj \
    --save_merged_lora_model True \

Mar 25 '25 15:03 icedpanda