Awni Hannun
Awni Hannun
Sure, we can add phi to our LoRA example! @mzbac already did some great work to merge Phi into the generation example. So from there it should be pretty straightforward....
> I would love this. I try unfreezing the model but that just leads to NaN loss. @mzbac makes good points. LoRA fine-tuning is much more stable for these large...
You can do something like: ```python module.update(tree_map(lambda p: p.astype(mx.float32), module.parameters())) ```
It shouldn't be 1/10th.. that probably means it's swapping :\. Unfortunately, fine-tuning in 32-bit precision is very memory hungry.. it's uncommon to use 32-bit even for pre-training with such large...
People do float16 and bfloat16 but both cases (typically) require modifications to actually make full training work. bfloat16 is easier than float16, but it sill often won't work with a...
> I tried converting all the weights after the model is created to float16, but that didn't work. What exactly do you mean by "didn't work"? In general that should...
@danilopeixoto does this command: ``` python -m mlx_lm.lora --train --model models/mixtral-8x7b-v0.1-8bit-64g/ --data datasets/chat-instruct/ --steps-per-report 1 --steps-per-eval 15 --save-every 15 --iters 500 --lora-layers 16 --batch-size 2 ``` Still produce NaN for...
> In addition, the experiment was using lambda m: isinstance(m, nn.Linear) as linear_class_predicate. That's another good tidbit. I never tried with quantized gates. It may not work.
I see.. we recently fixed a bug in our quantized kernels (https://github.com/ml-explore/mlx/pull/677) which may be related to this so maybe it will work in `0.3.0`. Just to be sure it...
@mzbac this is ready for review now right?