Phi-3CookBook
Phi-3CookBook copied to clipboard
Lora fine tuning phi-3.5 moe
Hi,
I recently fine-tuned the phi-3.5-moe-instruct model and phi-3.5-mini-instruct model using PEFT LORA. It seems the Moe model is performing way worse than 3.5 Mini Are there any specific things that need to be in mind during LORA fine-tuning with a mixture of expert models? And also during fine-tuning for Moe the validation loss is showing as No Log
Can you show me training result and hyperparameters? Did you fine tune using UNSLOTH?
sorry, there is no unsloth support for phi-3.5-moe-instruct model.
Training loss is keep on decreasing, but for validation loss it always showing as No Log
However, below are the hyper parameters:
"base_model": "microsoft/Phi-3.5-MoE-instruct", "max_seq_length": 4096,
"lora_config":
{
"rank": 32,
"alpha": 32,
"task_type": "CAUSAL_LM",
"target_modules": ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj",], #"all-linear",
"rslora": True,
"bias": "lora_only"
},
"quantization_config":
{
"load_in_4bit": False,
"bnb_4bit_quant_type": "nf4",
"bnb_4bit_use_double_quant": True
},
"training_arguments":
{
"logging_dir": f"logs/{get_iter_name(__file__)}/",
"output_dir":f"models/{get_iter_name(__file__)}/",
"evaluation_strategy": "steps",
"save_strategy": "steps",
"logging_strategy": "steps",
"learning_rate": 2e-4,
"weight_decay": 0.1,
"logging_steps": 5,
"eval_steps": 10,
"save_steps": 10,
"eval_delay": 100,
"warmup_steps": 0,
"save_total_limit": 5,
"optim": "adamw_torch_fused",
"per_device_train_batch_size": 2*2*2*2,
"per_device_eval_batch_size": 2*2*2*2,
"gradient_accumulation_steps": 4*2*2,
"eval_accumulation_steps": 4*2*2,
"gradient_checkpointing": True,
"adam_beta1": 0.9,
"adam_beta2": 0.95,
"adam_epsilon": 1e-8,
"max_grad_norm": 1.0,
"lr_scheduler_type": 'cosine',
"num_train_epochs": 1,
"continue_from_checkpoint": True,
"fp16": False,
"fp16_full_eval": False,
"bf16": True,
"bf16_full_eval": True,
},
phi-3.5-mini-instruct model have o_proj and qkv_proj so why are you adding ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj",] in targeted modules? see the layers in below image and see what are you targetting. It too doesnot have "gate_proj", "up_proj" , it has gate_up_proj.
and i too fine tuned phi-3.5-mini-instruct model. it is giving validation loss results tooo...
I am not using phi-3.5-mini I am using phi-3.5-moe-instruct model
phi-3.5-mini-instruct model have o_proj and qkv_proj so why are you adding ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj",] in targeted modules? see the layers in below image and see what are you targetting. It too doesnot have "gate_proj", "up_proj" , it has gate_up_proj.
and i too fine tuned phi-3.5-mini-instruct model. it is giving validation loss results tooo...
https://github.com/microsoft/Phi-3CookBook/blob/main/code/04.Finetuning/Phi-3-finetune-qlora-python.ipynb
The example in the example scripts in this cookbook appears to be misleading, as it shows
target_modules = ['k_proj', 'q_proj', 'v_proj', 'o_proj', 'gate_proj', 'down_proj', 'up_proj'],
which is incorrect based on the model's architecture.
@sofyc I think "target_modules": [''k_proj', 'q_proj', 'v_proj', 'o_proj'], it is okay
when i fine tune like this only o_proj is adjusted by lora
this is because there is a single qkv_proj layer.

