Awni Hannun
Awni Hannun
Hmm, some questions: - Was it a quantized model that you fine-tuned or an fp16 model? - Is the fused model even worse than the original baseline phi-2? Sometimes fusing...
That sounds awesome, let us know how it goes!
The `--model` flag should point to the original model (hugging face repo or local path) that you fine tuned with. 1. Could you share the output of: ``` ls /Users/antoine/Documents/GitHub/EVD.COVID_ANALYSIS/EVD.COVID_ANALYSIS/Models.nosync/Mistral/Mistral-7B-Instruct-v0.2/...
We certainly once the loss is `nan` then that means the model adapters aren't going to work. Could you share a command you used to reproduce that?
Which Mixtral model are you using? Is it quantized / fp16 /bf16 ? I will try running on the WikiSQL example dataset, but it may not reproduce there which could...
> AttributeError: 'MixtralModel' object has no attribute 'astype' Sorry that error above, what command is that from?
> Maybe unrelated, while I was fine-tuning the Gemma Moe model with bfloat16, I noticed that after a few iterations the loss became NaN. However, if I downgrade the MLX...
There was indeed a bug introduced between 0.3 and 0.4 which seems to have broken MOE training. Sorry about that! We'll try and do a better job testing for these...
Mixtral is training fine for me now. I will post a log after it runs for a 1000 iterations to be sure.
Seems to be working indeed: ``` Starting training..., iters: 3000 Iter 1: Val loss 2.480, Val took 0.732s Iter 10: Train loss 2.363, Learning Rate 1.000e-05, It/sec 1.022, Tokens/sec 88.083,...