FlagEmbedding icon indicating copy to clipboard operation
FlagEmbedding copied to clipboard

How Llam2 & BGE is finetuned in LM-Cocktail

Open cahuja1992 opened this issue 2 years ago • 1 comments

There is a table in the paper mentioning the data splits (train-test) of every domain. But the finetuning-related details are not there, can you please provide the finetuning details as well, like follows:

  • What are the hyperparameters used for finetuning?
  • If we need to reproduce the finetuned model, how that can be done?
  • Is any hyperparameter tuning done for finetuning the base model on the target domain?

cahuja1992 avatar Jan 04 '24 08:01 cahuja1992

Thanks for your interest in our work!

  1. We show the important hyperparameters in section 3: Experimental setup. Here is the detailed command we used to finetune llama with Fastchat tool: --num_train_epochs 3
    --per_device_train_batch_size 2
    --per_device_eval_batch_size 2
    --gradient_accumulation_steps 8
    --evaluation_strategy "no"
    --save_strategy "steps"
    --save_steps 1200
    --save_total_limit 10
    --learning_rate 2e-5
    --weight_decay 0.
    --warmup_ratio 0.03
    --lr_scheduler_type 'cosine'
    --logging_steps 10
    --deepspeed ./ds_config.json
    --tf32 True
    --model_max_length 1024
    --gradient_checkpointing True

For BGE, the command used in FlagEmbedding is --normlized True
--temperature 0.02
--do_train
--train_data $DATA_PATH
--query_max_len 48
--passage_max_len 200
--fp16
--per_device_train_batch_size 32
--sentence_pooling_method cls
--save_steps 2000
--train_group_size 8
--learning_rate 2e-5
--num_train_epochs ${EPOCH[i]}
--negatives_cross_device
--dataloader_num_workers 8
--logging_steps 20
--warmup_ratio 0.1
--weight_decay 0.01
--overwrite_output_dir True These experiments are conducted on 8*A100(40G) GPUs.

  1. We provide the fine-tuned models in huggingface , you can reproduce the experimental results with them. If you want to reproduce the finetuned model, you can download the data from intfloat/llm-retriever-tasks. We also can share the preprocessed training data, and we will tell you when the data is ready.

  2. We merely selected a suitable set of parameters without hyperparameter tuning.

staoxiao avatar Jan 04 '24 09:01 staoxiao