Fusing adapters with llama3 cause bad performances
Hello,
I'm using the following script to fine tune the llama3 model with a custom dataset of questions & responses using the {'prompt: "", completion:""} format defined here.
#!/usr/bin/env bash
DATA_PATH=data
ADAPTERS_PATH=adapters
MODEL_NAME=meta-llama/Meta-Llama-3-8B
MODEL_PATH=models/mlx
LORA_CONFIG_PATH=lora_config.yaml
NAME="my-assistant"
ITERATIONS=1000
# Parse options
while [[ "$#" -gt 0 ]]; do
case $1 in
--name)
if [[ -z "$2" ]] || [[ "$2" == --* ]]; then
echo "Error: Model name cannot be empty"
echo "Usage: fine-tune.sh --iter <number of iterations> --name <output model name>"
exit 1
fi
NAME="$2"
shift
;;
--iter)
if [[ "$2" -lt 100 || "$2" -gt 10000 ]]; then
echo "Error: Iteration value must be an integer between 100 and 10000"
exit 1
fi
ITERATIONS="$2"
shift
;;
*)
echo "Unknown option: $1."
echo "Usage: fine-tune.sh --iter <number of iterations> --name <output model name>"
exit 1
;;
esac
shift
done
echo "Fine-tuning with $ITERATIONS iterations"
echo "Output model name: $NAME"
FINE_TUNED_MODEL_PATH=models/$NAME
GGUF_MODEL_PATH=models/$NAME.gguf
set -ex
# Install llama.cpp if needed
if [ ! -d "llama.cpp" ]; then
git clone [email protected]:ggerganov/llama.cpp.git;
cd llama.cpp;
make -j 8;
cd ..;
else
echo "Directory llama.cpp found, skipping download and build"
fi
# Download & quantize the HuggingFace model to reduce weights memory footprint
if [ ! -d "$MODEL_PATH" ]; then
echo "No models found, initiating quantization...";
python -m mlx_lm.convert \
--hf-path "$MODEL_NAME" \
--mlx-path "$MODEL_PATH" \
-q;
else
echo "Model found in $MODEL_PATH, skipping quantization"
fi
# Launch fine-tuning
python -m mlx_lm.lora \
--data "$DATA_PATH" \
--model "$MODEL_PATH" \
--train \
--iters "$ITERATIONS" \
--config "$LORA_CONFIG_PATH"
# Merge the model and fine-tuned adapter
python -m mlx_lm.fuse \
--model "$MODEL_PATH" \
--adapter-path "$ADAPTERS_PATH" \
--save-path "$FINE_TUNED_MODEL_PATH" \
--de-quantize
# Generate GGUF file to use the new model with Ollama
python llama.cpp/convert-hf-to-gguf.py "$FINE_TUNED_MODEL_PATH" \
--outfile "$GGUF_MODEL_PATH" \
--outtype q8_0 \
ollama create "$NAME" -f Modelfile
I trained the model over 1000 iterations with the following config parameters:
# Number of validation batches, -1 uses the entire validation set.
val_batches: -1
# Adam learning rate.
learning_rate: 1e-5
# Number of training steps between loss reporting.
steps_per_report: 10
# Number of training steps between validations.
steps_per_eval: 50
# Save the model every N iterations.
save_every: 100
# Evaluate on the test set after training
test: true
# Number of test set batches, -1 uses the entire test set.
test_batches: 100
lora_parameters:
# The layer keys to apply LoRA to.
# These will be applied for the last lora_layers
rank: 8
scale: 20.0
dropout: 0.0
The Modelfile s below:
FROM ./models/my-assistant.gguf
TEMPLATE """{{ if .System }}<|im_start|>system
{{ .System }}<|im_end|>
{{ end }}{{ if .Prompt }}<|im_start|>user
{{ .Prompt }}<|im_end|>
{{ end }}<|im_start|>assistant
"""
PARAMETER stop "<|im_end|>"
PARAMETER stop "<|im_start|>"
PARAMETER temperature 0
During the training I get the following message but I don't know what I am supposed to do with mlx_lm options:
No chat template is set for this tokenizer, falling back to a default class-level template. This is very error-prone, because models are often trained with templates different from the class default! Default chat templates are a legacy feature and will be removed in Transformers v4.43, at which point any code depending on them will stop working. We recommend setting a valid chat template before then to ensure that this model continues working without issues.
The final loss on training set (1300 samples) is around 0.16 and over validation set is 0.26. When running the generate command :
python -m mlx_lm.generate --model models/mlx --adapter-path adapters --prompt "<question>"
I get very good results and nothing to complain about.
However when I use the fused model generated by my script using this documentation, the performances are super bad and answer are using random values when asking for specific amounts the model was trained for. The performances are as bad as if I didn't do any fine tuning.
Did I do something wrong or is the fusing process supposed to be that bad ? I there a way to export the model used by the mlx_lm.generate command to GGUF instead of relying on mlx_lm.fuse ?
Thank you.
Facing pretty much the same problem on my end too with a different model (Mistral). #849
I'm not certain this is the problem so it would be good to validate it. But fusing can cause precision issues. In low precision: c = a + b can give very inexact results if a and b have very different magnitudes. For example if a is big and b is small then c = a + b = a. In your case if the adapters have small values and the original weight matrix has large values, then fusing can wipe-out the adapters rendering the baseline model.
Now, I'm not sure that's happening. There's a couple things you could do to check.
- Inspect the magnitudes of the weights and adapters
- Try fusing and running the model in higher precision (e.g. fp32) just as a test that it works.
- Using a larger scale sometimes (but not always) helps here also.
Another option is to avoid fusing entirely. There may be a way to run unfused models with llama.cpp (see https://github.com/ml-explore/mlx-examples/issues/816#issuecomment-2173341375). Or you could use MLX LM to run the fine-tuned model instead of using llama.cpp?
This makes sense :) Thank you. Will have to look into that :)
Hey @Timelessprod , my question is a bit unrelated so really sorry about that, but I just wanted to clarify if I am doing this right.
This is an example line of my jsonl dataset:
{"text": "<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nThe question<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nThe answer<|end_of_text|>"}
And I am simply running this command in the terminal:
mlx_lm.lora \
--train \
--model meta-llama/Meta-Llama-3-8B-Instruct \
--data /Users/macstudiosct/projects/data_location \
--batch-size 3 \
--lora-layers -1 \
--iters 10000 --max-seq-length 8000
Please let me know if I am doing something wrong. I get okay results with this but just adding --use dora outputs rubbish
It looks fine to me.
but just adding
--use_doraoutputs rubbish
What kind of rubbish? Does the loss go down or not really?
It looks fine to me.
but just adding
--use_doraoutputs rubbishWhat kind of rubbish? Does the loss go down or not really?
Well, this is a bit embarrassing but I didn't record the loss properly and I had used an older version too.
I am training something else right now but will try with dora next and open an issue if I face the same problem. I just wanted to know if there is something to be considered while using dora or just --use-dora does the work
It goes without saying but thanks a lot for all your prompt help, @awni. You keep saving my beginner behind every time xD
I just wanted to know if there is something to be considered while using dora or just --use-dora does the work
It should work to do --use-dora, as in the training loss should go down. It may be possible to improve the results by twiddling with hyper parameters like learning rate, etc. But I would check that its working first.
I'm not certain this is the problem so it would be good to validate it. But fusing can cause precision issues. In low precision:
c = a + bcan give very inexact results ifaandbhave very different magnitudes. For example ifais big andbis small thenc = a + b = a. In your case if the adapters have small values and the original weight matrix has large values, then fusing can wipe-out the adapters rendering the baseline model.Now, I'm not sure that's happening. There's a couple things you could do to check.
- Inspect the magnitudes of the weights and adapters
- Try fusing and running the model in higher precision (e.g. fp32) just as a test that it works.
- Using a larger scale sometimes (but not always) helps here also.
Another option is to avoid fusing entirely. There may be a way to run unfused models with llama.cpp (see #816 (comment)). Or you could use MLX LM to run the fine-tuned model instead of using llama.cpp?
Hello, thanks for the reply.
Do you recommend any tool to look at the weights and adapters as the files are binaries?
Isn't the quantization supposed to avoid those precisions and magnitude problems? Why would it only happens at fuse and not at inference when using base model + adapters? I already tried to move the scale from 8.0 to 20.0, would you recommend to go even higher maybe?
Also by current project constraints I need the model to be output as a GGUF file (so I can easily use the model with other tools like Ollama) that's why I'm doing the fuse here, I'll try to find another solution to avoid fusing altogether.
Thank you very much for your help!
Here's my loss when running the original script with 13k training samples, 3k validation samples (random split) over 1000 iterations with learning rate of 1e-6, rank of 16 and scale of 60.0.
Even if it looks quite good, I get a lot of bad informations on answers
Do you recommend any tool to look at the weights and adapters as the files are binaries?
You can load them using mx.load. That will give you a dictionary of keys -> arrays. Then you can inspect them however you like.
I already tried to move the scale from 8.0 to 20.0, would you recommend to go even higher maybe?
Yea you can try even higher. Or possibly train for longer at the 20.0 scale
Even if it looks quite good, I get a lot of bad informations on answers
That's unexpected. Does it work without fusing? How does the loss compare to a model fine-tuned with a smaller scale?
I just wanted to know if there is something to be considered while using dora or just --use-dora does the work
It should work to do
--use-dora, as in the training loss should go down. It may be possible to improve the results by twiddling with hyper parameters like learning rate, etc. But I would check that its working first.
@awni I believe there is a bug in the Dora implementation that could be causing this. When the DoRALinear instance is initialized, the magnitude self.m is computed from the weights of a randomly initialized Linear layer instead of the weights on the adapted linear layer:
https://github.com/ml-explore/mlx-examples/blob/85dc76f6e0f2cf3ee3d84c211868a6856e163f3f/llms/mlx_lm/tuner/dora.py#L64-L78
When the linear layer is changed in from_linear the magnitude self.m of the weights isn't recomputed:
https://github.com/ml-explore/mlx-examples/blob/85dc76f6e0f2cf3ee3d84c211868a6856e163f3f/llms/mlx_lm/tuner/dora.py#L21-L29