How Llam2 & BGE is finetuned in LM-Cocktail
There is a table in the paper mentioning the data splits (train-test) of every domain. But the finetuning-related details are not there, can you please provide the finetuning details as well, like follows:
- What are the hyperparameters used for finetuning?
- If we need to reproduce the finetuned model, how that can be done?
- Is any hyperparameter tuning done for finetuning the base model on the target domain?
Thanks for your interest in our work!
- We show the important hyperparameters in section 3: Experimental setup.
Here is the detailed command we used to finetune llama with Fastchat tool:
--num_train_epochs 3
--per_device_train_batch_size 2
--per_device_eval_batch_size 2
--gradient_accumulation_steps 8
--evaluation_strategy "no"
--save_strategy "steps"
--save_steps 1200
--save_total_limit 10
--learning_rate 2e-5
--weight_decay 0.
--warmup_ratio 0.03
--lr_scheduler_type 'cosine'
--logging_steps 10
--deepspeed ./ds_config.json
--tf32 True
--model_max_length 1024
--gradient_checkpointing True
For BGE, the command used in FlagEmbedding is
--normlized True
--temperature 0.02
--do_train
--train_data $DATA_PATH
--query_max_len 48
--passage_max_len 200
--fp16
--per_device_train_batch_size 32
--sentence_pooling_method cls
--save_steps 2000
--train_group_size 8
--learning_rate 2e-5
--num_train_epochs ${EPOCH[i]}
--negatives_cross_device
--dataloader_num_workers 8
--logging_steps 20
--warmup_ratio 0.1
--weight_decay 0.01
--overwrite_output_dir True
These experiments are conducted on 8*A100(40G) GPUs.
-
We provide the fine-tuned models in huggingface , you can reproduce the experimental results with them. If you want to reproduce the finetuned model, you can download the data from intfloat/llm-retriever-tasks. We also can share the preprocessed training data, and we will tell you when the data is ready.
-
We merely selected a suitable set of parameters without hyperparameter tuning.