Phi-3CookBook Lora fine tuning phi-3.5 moe

trafficstars

Hi,

I recently fine-tuned the phi-3.5-moe-instruct model and phi-3.5-mini-instruct model using PEFT LORA. It seems the Moe model is performing way worse than 3.5 Mini Are there any specific things that need to be in mind during LORA fine-tuning with a mixture of expert models? And also during fine-tuning for Moe the validation loss is showing as No Log

Sep 08 '24 20:09 Manikanta5112

Can you show me training result and hyperparameters? Did you fine tune using UNSLOTH?

Sep 11 '24 17:09 sujankarki269

sorry, there is no unsloth support for phi-3.5-moe-instruct model.

Training loss is keep on decreasing, but for validation loss it always showing as No Log

However, below are the hyper parameters:

"base_model": "microsoft/Phi-3.5-MoE-instruct", "max_seq_length": 4096,

"lora_config":
{
    "rank": 32,
    "alpha": 32,
    "task_type": "CAUSAL_LM",
    "target_modules": ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj",], #"all-linear",
    "rslora": True,
    "bias": "lora_only"
},

"quantization_config":
{
    "load_in_4bit": False,
    "bnb_4bit_quant_type": "nf4",
    "bnb_4bit_use_double_quant": True
},

"training_arguments":
{
    "logging_dir": f"logs/{get_iter_name(__file__)}/",
    "output_dir":f"models/{get_iter_name(__file__)}/",
    "evaluation_strategy": "steps",
    "save_strategy": "steps",
    "logging_strategy": "steps",
    "learning_rate": 2e-4, 
    "weight_decay": 0.1,
    "logging_steps": 5,
    "eval_steps": 10,
    "save_steps": 10,
    "eval_delay": 100,
    "warmup_steps": 0,
    "save_total_limit": 5,
    
    "optim": "adamw_torch_fused",
    "per_device_train_batch_size": 2*2*2*2,
    "per_device_eval_batch_size": 2*2*2*2,
    "gradient_accumulation_steps": 4*2*2,
    "eval_accumulation_steps": 4*2*2,
    "gradient_checkpointing": True,

    "adam_beta1": 0.9,
    "adam_beta2": 0.95,
    "adam_epsilon": 1e-8,
    "max_grad_norm": 1.0,
    "lr_scheduler_type": 'cosine',
    "num_train_epochs": 1,
    "continue_from_checkpoint": True,

    "fp16": False,
    "fp16_full_eval": False,

    "bf16": True,
    "bf16_full_eval": True,
},

Sep 11 '24 18:09 Manikanta5112

phi-3.5-mini-instruct model have o_proj and qkv_proj so why are you adding ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj",] in targeted modules? see the layers in below image and see what are you targetting. It too doesnot have "gate_proj", "up_proj" , it has gate_up_proj.

and i too fine tuned phi-3.5-mini-instruct model. it is giving validation loss results tooo...

Sep 12 '24 11:09 sujankarki269

I am not using phi-3.5-mini I am using phi-3.5-moe-instruct model

Sep 12 '24 11:09 Manikanta5112

phi-3.5-mini-instruct model have o_proj and qkv_proj so why are you adding ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj",] in targeted modules? see the layers in below image and see what are you targetting. It too doesnot have "gate_proj", "up_proj" , it has gate_up_proj.

and i too fine tuned phi-3.5-mini-instruct model. it is giving validation loss results tooo...

https://github.com/microsoft/Phi-3CookBook/blob/main/code/04.Finetuning/Phi-3-finetune-qlora-python.ipynb

The example in the example scripts in this cookbook appears to be misleading, as it shows

target_modules = ['k_proj', 'q_proj', 'v_proj', 'o_proj', 'gate_proj', 'down_proj', 'up_proj'],

which is incorrect based on the model's architecture.

Sep 12 '24 22:09 sofyc

@sofyc I think "target_modules": [''k_proj', 'q_proj', 'v_proj', 'o_proj'], it is okay

Sep 17 '24 07:09 kinfey

when i fine tune like this only o_proj is adjusted by lora

this is because there is a single qkv_proj layer.

Sep 17 '24 07:09 sujankarki269

Phi-3CookBook Phi-3CookBook copied to clipboard

Lora fine tuning phi-3.5 moe

Phi-3CookBook
Phi-3CookBook copied to clipboard