Awni Hannun comments

Results 1014 comments of


                                            Awni Hannun

Enable the Mixtral-like Moe model without the quantized gate layer

> In my experiment of fine-tuning on a complex dataset (guanaco) Could you point me to this dataset? I can try training on it to see if I can repro...

Enable the Mixtral-like Moe model without the quantized gate layer

Sorry I am still not understanding 100%: > Even though I want to overfit it, the loss doesn't go down to 0.6x This is with WikiSQL right? Just curious, where...

Enable the Mixtral-like Moe model without the quantized gate layer

Regarding capacity issues, some simple things to try are: 1. Add more LoRA layers (like you have done), but also make more of the linear layers work with LoRA (like...

Enable the Mixtral-like Moe model without the quantized gate layer

I think it’s a good idea to have issues in mlx for the missing grads. I’m still not certain that is the problem here though. Sorry I have been intending...

Enable the Mixtral-like Moe model without the quantized gate layer

> Missing grad for mx.argpartition, in that case are we still be able to fine-tune expert's mlp layers Yes there can still be gradient to those layers. Did you try...

Enable the Mixtral-like Moe model without the quantized gate layer

Awesome, let me know how it goes!! I was just planning to try some MOE fine-tunes to see how it all works myself.

Enable the Mixtral-like Moe model without the quantized gate layer

> python download_dataset.py WizardLM/WizardLM_evol_instruct_70k @mzbac is that dataset a good one to try to see how well the MOE fine-tuning works?

Enable the Mixtral-like Moe model without the quantized gate layer

@mzbac I tried LoRA fine tuning Phixtral on the wizard dataset and it works pretty well as far as I can tell: ``` Iter 1: Val loss 1.716, Val took...

Enable the Mixtral-like Moe model without the quantized gate layer

> the Phixtral doesn't have a noise linear layer in the implementation Right I did not train with that. It might be collapsing to one expert.. I will see if...

Enable the Mixtral-like Moe model without the quantized gate layer

So the model (as you've pointed out before) starts out only using two experts for every token, and it stays that way during Lora finetuning. I checked the grad of...