Awni Hannun
Awni Hannun
> In my experiment of fine-tuning on a complex dataset (guanaco) Could you point me to this dataset? I can try training on it to see if I can repro...
Sorry I am still not understanding 100%: > Even though I want to overfit it, the loss doesn't go down to 0.6x This is with WikiSQL right? Just curious, where...
Regarding capacity issues, some simple things to try are: 1. Add more LoRA layers (like you have done), but also make more of the linear layers work with LoRA (like...
I think it’s a good idea to have issues in mlx for the missing grads. I’m still not certain that is the problem here though. Sorry I have been intending...
> Missing grad for mx.argpartition, in that case are we still be able to fine-tune expert's mlp layers Yes there can still be gradient to those layers. Did you try...
Awesome, let me know how it goes!! I was just planning to try some MOE fine-tunes to see how it all works myself.
> python download_dataset.py WizardLM/WizardLM_evol_instruct_70k @mzbac is that dataset a good one to try to see how well the MOE fine-tuning works?
@mzbac I tried LoRA fine tuning Phixtral on the wizard dataset and it works pretty well as far as I can tell: ``` Iter 1: Val loss 1.716, Val took...
> the Phixtral doesn't have a noise linear layer in the implementation Right I did not train with that. It might be collapsing to one expert.. I will see if...
So the model (as you've pointed out before) starts out only using two experts for every token, and it stays that way during Lora finetuning. I checked the grad of...