mlx-examples icon indicating copy to clipboard operation
mlx-examples copied to clipboard

Need more documentation on using a custom dataset for fine-tuning with LoRA

Open chimezie opened this issue 8 months ago • 20 comments

The Custom Data section of the LoRa readme has helpful information about how to specify a subdirectory where train.jsonl, valid.jsonl, and test.json files are expected and the dataset in data/ has json entries of the following form:

{"text": "table: 1-1000181-1\ncolumns: State/territory, Text/background colour, Format, Current slogan, Current series, Notes\nQ: Tell me what the notes are for South Australia \nA: SELECT Notes FROM 1-1000181-1 WHERE Current slogan = 'SOUTH AUSTRALIA'"}

Looking at lora.py, I can see how to specify the JSON key (which defaults to text) but it is not clear if the instruction prompt for the custom data we provide must be in that format. Can we use a custom dataset with the Mistral Model (for example) and use the Mistral Prompt to LoRa train it?:

{"text": "<s>[INST] .. instruction .. [/INST] .. response .. </s>'"}

Or are there assumptions in lora.py that would break if we don't use the format below (from the data/ subdirectory)?

{"text": ".. context ..  \nQ: .. instruction .. \nA: .. output ..'"}

Looking at lora.py and the loss function in particular, it is unclear if we must stick with the prompt format in the example, if the training method is supervised or unsupervised, etc.

If this module was intended to be used for general purposes, some additional documentation regarding these questions would greatly help in using it (or extending it) for LoRA training on various models, prompt formats, and datasets (training on just completions - continued pretraining - for example)

chimezie avatar Dec 28 '23 02:12 chimezie

Custom dataset format requires detailed explanation

lufred8341 avatar Dec 28 '23 17:12 lufred8341

I tried a fine-tune just now and got no results--the model+adapter gives the exact same responses as the base model at inference. I tried this format but no idea if it was correct: {"text": "Q: What were the five measures taken by Caius Gracchus to...?\nA: Caius Gracchus implemented ..."}

USMCM1A1 avatar Dec 28 '23 19:12 USMCM1A1

Was the model actually training? Can you add the log here?

awni avatar Dec 28 '23 19:12 awni

The format of the text doesn't really matter. What matters more is that you are consistent in your dataset. So if you use a special keyword in the training set, make sure you use it in the test set and when generating text as well.

The only exception is that you shouldn't include <s> and <\s>, that is already done in the tokenizer.

awni avatar Dec 28 '23 19:12 awni

Was the model actually training? Can you add the log here?

Where or how to get the log file for LORA training?

gladjoyhub avatar Dec 29 '23 16:12 gladjoyhub

I got this from LoRA training: Iter 1: Val loss 2.929, Val took 1494.344s

libc++abi: terminating due to uncaught exception of type std::runtime_error: [malloc_or_wait] Unable to allocate 4352 bytes.

But the first time I ran LoRA training, command prompt was showing training, GPU was working hard, so I left the screen. When I came back, I found the terminal was trying to run each line of the train.jsonl file. It's a Mac Studio M1 Max 32GB.

gladjoyhub avatar Dec 29 '23 16:12 gladjoyhub

What model are you training? It looks like you ran out of memory...

By the log, I just meant the output of the python lora.py ....

awni avatar Dec 29 '23 17:12 awni

Was the model actually training? Can you add the log here?

Thanks for responding and also for being patient with me (English PhD here fumbling around and kind of lost).

The model I used was mistral-7B-v0.1. After using the torch to mlx commands, I did a LoRA fine tune. Log below: python lora.py --model /Users/me/mlx-examples/lora/mistral-7b-v01-mlx \ --train \ --iters 384

Loading pretrained model Total parameters 7243.436M Trainable parameters 1.704M Loading datasets Training Iter 1: Val loss 1.777, Val took 101.933s Iter 10: Train loss 1.741, It/sec 0.162, Tokens/sec 235.847 Iter 20: Train loss 1.788, It/sec 0.147, Tokens/sec 213.199 Iter 30: Train loss 1.802, It/sec 0.149, Tokens/sec 202.434 Iter 40: Train loss 1.808, It/sec 0.148, Tokens/sec 201.778 Iter 50: Train loss 1.783, It/sec 0.151, Tokens/sec 205.573 Iter 60: Train loss 1.747, It/sec 0.151, Tokens/sec 208.956 Iter 70: Train loss 1.740, It/sec 0.150, Tokens/sec 214.980 Iter 80: Train loss 1.759, It/sec 0.150, Tokens/sec 206.417 Iter 90: Train loss 1.769, It/sec 0.152, Tokens/sec 219.480 Iter 100: Train loss 1.763, It/sec 0.149, Tokens/sec 219.803 Iter 110: Train loss 1.661, It/sec 0.152, Tokens/sec 211.097 Iter 120: Train loss 1.699, It/sec 0.150, Tokens/sec 215.326 Iter 130: Train loss 1.752, It/sec 0.150, Tokens/sec 210.700 Iter 140: Train loss 1.668, It/sec 0.157, Tokens/sec 222.955 Iter 150: Train loss 1.693, It/sec 0.149, Tokens/sec 205.185 Iter 160: Train loss 1.660, It/sec 0.149, Tokens/sec 211.944 Iter 170: Train loss 1.767, It/sec 0.150, Tokens/sec 216.436 Iter 180: Train loss 1.686, It/sec 0.150, Tokens/sec 205.317 Iter 190: Train loss 1.671, It/sec 0.153, Tokens/sec 216.670 Iter 200: Train loss 1.634, It/sec 0.148, Tokens/sec 221.625 Iter 200: Val loss 1.610, Val took 102.038s Iter 210: Train loss 1.652, It/sec 0.150, Tokens/sec 213.808 Iter 220: Train loss 1.623, It/sec 0.149, Tokens/sec 209.382 Iter 230: Train loss 1.618, It/sec 0.147, Tokens/sec 222.131 Iter 240: Train loss 1.665, It/sec 0.152, Tokens/sec 210.733 Iter 250: Train loss 1.651, It/sec 0.147, Tokens/sec 223.296 Iter 260: Train loss 1.672, It/sec 0.150, Tokens/sec 220.985 Iter 270: Train loss 1.646, It/sec 0.146, Tokens/sec 213.665 Iter 280: Train loss 1.638, It/sec 0.150, Tokens/sec 223.153 Iter 290: Train loss 1.604, It/sec 0.149, Tokens/sec 205.401 Iter 300: Train loss 1.608, It/sec 0.149, Tokens/sec 213.435 Iter 310: Train loss 1.666, It/sec 0.151, Tokens/sec 211.191 Iter 320: Train loss 1.664, It/sec 0.152, Tokens/sec 215.018 Iter 330: Train loss 1.681, It/sec 0.153, Tokens/sec 219.760 Iter 340: Train loss 1.508, It/sec 0.150, Tokens/sec 221.116 Iter 350: Train loss 1.644, It/sec 0.151, Tokens/sec 210.450 Iter 360: Train loss 1.644, It/sec 0.152, Tokens/sec 219.091 Iter 370: Train loss 1.573, It/sec 0.152, Tokens/sec 210.283 Iter 380: Train loss 1.666, It/sec 0.139, Tokens/sec 206.593

This is how I called the LoRA adapters at inference: python lora.py --model /Users/me/mlx-examples/lora/mistral-7b-v01-mlx \

           --adapter-file /Users/me/mlx-examples/lora/adapters.npz \
           --num-tokens 250 \
           --prompt "

Q: What was Augustus' greatest achievement? A: " Loading pretrained model Total parameters 7243.436M Trainable parameters 1.704M Loading datasets Generating `

USMCM1A1 avatar Dec 29 '23 18:12 USMCM1A1

Thanks, looks like it is training very very slowly (the loss didn't change much over the 300+ iterations), not sure why. The train loss should go down faster. Were you using a custom dataset for this?

awni avatar Dec 29 '23 18:12 awni

What model are you training? It looks like you ran out of memory...

By the log, I just meant the output of the python lora.py ....

model was mistral-7B-v0.1 from huggingface mlx, RAM used was 24GB at most out of 32GB. But in train.jsonl, there is a line that's very long, like 2000 words.

gladjoyhub avatar Dec 29 '23 18:12 gladjoyhub

But in train.jsonl, there is a line that's very long, like 2000 words

Is that the default train.jsonl or a custom one? You should split those lines o/w they will consume a ton of memory. See the section on reducing memory use.

awni avatar Dec 29 '23 18:12 awni

Thanks, looks like it is training very very slowly (the loss didn't change much over the 300+ iterations), not sure why. The train loss should go down faster. Were you using a custom dataset for this?

Yes, I replaced the default with a custom train.jsonl file of Q/A pairs in this format:

{"text": "Q: "According to Lucretius, what is the role of religion in relation to the many evils it has persuaded men to commit, as stated in the line 'Tantum religio potuit suadere malorum'?"\nA: According to Lucretius, religion has played a significant role in leading humans to commit many evils, as expressed in the line 'Tantum religio potuit suadere malorum.' In his view, religion has caused humans to engage in harmful practices such as human sacrifices and offering hecatombs to gods that are modeled after human greed. He criticizes the fear instilled by religion, including the fear of lightning, thunder, death, hell, and other subterranean horrors depicted in Etruscan art and Oriental mysteries.\n\nLucretius attributes this negative influence of religion to the misconception that gods require sacrificial rituals and bloodshed from humans. Instead, he argues that true piety lies in a peaceful state of mind, which is not achieved through frequenting religious places or performing ritualistic acts, but rather by contemplating beauty, fostering friendships, and maintaining peace.\n\nIn Lucretius' perspective, the gods exist but reside far away from human concerns, indifferent to our sacrifices and prayers. They are not involved in the creation of the world or the causation of events, and he questions the notion that anyone would want to attribute such responsibilities to them."} {"text": "Q: What were the five measures taken by Caius Gracchus to garner support from the peasantry, army, proletariat, and businessmen, and how did these actions contribute to his political power in Rome?\nA: Caius Gracchus implemented five measures to garner support from different groups in Rome:\n\n1. Peasantry: To gain the support of the peasantry, Caius renewed the agrarian legislation of his brother, extending its application to state-owned land in the provinces. He restored the land board and personally attended to its operations. This helped him secure the backing of the peasantry by addressing their concerns about land ownership and farming rights."}

I had meant to use a batch size of 6 but clipped it off by accident. Will add and try again--thanks.

USMCM1A1 avatar Dec 29 '23 18:12 USMCM1A1

What model are you training? It looks like you ran out of memory... By the log, I just meant the output of the python lora.py ....

model was mistral-7B-v0.1 from huggingface mlx, RAM used was 24GB at most out of 32GB. But in train.jsonl, there is a line that's very long, like 2000 words.

Weird--I just went into the train.jsonl file and there's no particularly long one. The longest is 337 words.

USMCM1A1 avatar Dec 29 '23 19:12 USMCM1A1

Is that the default train.jsonl or a custom one? You should split those lines o/w they will consume a ton of memory. See the section on reducing memory use.

By split them, do you mean split them into separate training data records? i.e.

{"text": ".. context ..  \nQ: .. instruction .. \nA: .. output portion 1 ..'"},
{"text": ".. context ..  \nQ: .. instruction .. \nA: .. output portion 2 ..'"},
{"text": ".. context ..  \nQ: .. instruction .. \nA: .. output portion N ..'"},

or do you mean split them within each training text string by inserting new line characters? i.e.

{"text": ".. context ..  \nQ: .. instruction .. \nA: .. output portion 1 .. \n .. output portion 2 .. output portion N '"},

chimezie avatar Dec 29 '23 19:12 chimezie

Oh fudge should I have put a comma in between each record?

{"text": ".. context .. \nQ: .. instruction .. \nA: .. output portion 1 ..'"}, {"text": ".. context .. \nQ: .. instruction .. \nA: .. output portion 2 ..'"}, {"text": ".. context .. \nQ: .. instruction .. \nA: .. output portion N ..'"},

USMCM1A1 avatar Dec 29 '23 19:12 USMCM1A1

By split them, do you mean split them into separate training data records? i.e.

Like you have it in the first case. Split them into different lines in the json each with their own text key.

awni avatar Dec 29 '23 19:12 awni

Oh fudge should I have put a comma in between each record?

No actually, I think it will break if they have commas.

awni avatar Dec 29 '23 19:12 awni

But in train.jsonl, there is a line that's very long, like 2000 words

Is that the default train.jsonl or a custom one? You should split those lines o/w they will consume a ton of memory. See the section on reducing memory use.

It's a custom train.jsonl. Yes, your link solved the problem --batch-size 1 --lora-layers 4

Thanks Awni!

gladjoyhub avatar Dec 30 '23 06:12 gladjoyhub

<s> and <\s>

But there is no <s> and <\s> in the official example.

592319702 avatar Feb 28 '24 06:02 592319702

<s> and </s> are special tokens that will be automatically included by the tokenizer during encoding: https://huggingface.co/docs/transformers/main_classes/tokenizer#transformers.PreTrainedTokenizer.encode.add_special_tokens

mzbac avatar Feb 28 '24 06:02 mzbac