alpaca-lora icon indicating copy to clipboard operation
alpaca-lora copied to clipboard

Bad dataset

Open abacaj opened this issue 2 years ago • 18 comments

If anyone is curious here is my run on the Alpaca dataset using another decoder model (codegen-16B-nl). Appears the dataset isn't diverse, multiple closely related answers. I believe this dataset is not capable of generalizing well to new data.

The loss from the original Alpaca training script follows a similar pattern used in OPT-IML to compute loss based on the label.

image

My run on codegen-16B-nl

image

Another user's run on LLaMA 7B

image

Some more discussion: https://twitter.com/abacaj/status/1637310768780648448

abacaj avatar Mar 19 '23 06:03 abacaj

Hmm, i wonder if we could get better results using a dataset with higher quality like https://github.com/allenai/natural-instructions/tree/master/splits/default

Wingie avatar Mar 19 '23 08:03 Wingie

Yeah, I'm currently playing with Laion OIG small-chip2, but I'm really looking forward to the result of their Open-Assistant project, which will create a fully human generated, refined, and curated instruction dataset.

kooshi avatar Mar 19 '23 13:03 kooshi

Dataset is the problem I guess. Here are just two examples of inputs/outputs which teaches the model nothing but hallucinating.

{
    "instruction": "Retrieve the last three transactions from the customer's account.",
    "input": "Customer ID: 12333",
    "output": "Transaction 1: USD"
},

{
    "instruction": "Transcribe the recording into text.",
    "input": "Recording about the history of Waterloo",
    "output": "The Battle of Waterloo, fought on June 18, 1815, was the last major battle of the Napoleonic Wars. It marked the end of Napoleon Bonaparte's rule. The battle was fought between Napoleon Bonaparte's French forces and an allied force of British, Dutch, Belgian and Prussian troops, led by the Duke of Wellington. It was fought near the town of Waterloo in the former Austrian Netherlands, now present-day Belgium. The battle was a defeat for Napoleon's forces and ended Napoleon's rule over France....
}

I'm sure there are a lot of such bad instructions there which makes more problems than proper learning.

baleksey avatar Mar 19 '23 15:03 baleksey

There is definitely a lot of bad data in the dataset. I went through and manually cleaned hundreds of issues resulting in the current "cleaned" dataset, but there are likely many more issues I missed.

I came across and fixed many issues referencing data on the internet that would likely just cause hallucinations, for example:

{
   "instruction": "Describe what is shown in this painting in one sentence",
   "input": "https://works.bepress.com/jing-chen/3/download/",
   "output": "This painting depicts a mother embracing her child in a loving embrace, surrounded by a peaceful and tranquil atmosphere."
}

There were hundreds of similar examples in the original dataset.

@abacaj Are the training results you posted of the original dataset or the cleaned dataset?

gururise avatar Mar 19 '23 16:03 gururise

There is definitely a lot of bad data in the dataset. I went through and manually cleaned hundreds of issues resulting in the current "cleaned" dataset, but there are likely many more issues I missed.

I came across and fixed many issues referencing data on the internet that would likely just cause hallucinations, for example:

{
   "instruction": "Describe what is shown in this painting in one sentence",
   "input": "https://works.bepress.com/jing-chen/3/download/",
   "output": "This painting depicts a mother embracing her child in a loving embrace, surrounded by a peaceful and tranquil atmosphere."
}

There were hundreds of similar examples in the original dataset.

@abacaj Are the training results you posted of the original dataset or the cleaned dataset?

The runs posted are from the original, unfortunately only discovered this dataset after posting on twitter. But I can probably do a smaller model run to see if it improves the loss. I personally don't know it will help to remove what appear to be bad samples

To me it seems like there are very similar answers in the labels of the dataset which are used to compute the loss, which could lead to overfitting/reduce ability to generalize. Here's 3 examples I found - maybe not a issue but it appears to contribute to the flat loss - ex not a diverse dataset

{
    "instruction": "Generate a list of 5 American cities with populations higher than 1 million.",
    "input": "",
    "output": "New York City, Los Angeles, Chicago, Houston, Phoenix."
},

{
    "instruction": "Name 5 US cities with populations over 1 million.",
    "input": "",
    "output": "The 5 US cities with populations over 1 million are: New York City, Los Angeles, Chicago, Houston, and Phoenix."
},

{
    "instruction": "Generate a list of four US cities with over one million population.",
    "input": "",
    "output": "New York City, Los Angeles, Chicago, Houston."
},

abacaj avatar Mar 19 '23 16:03 abacaj

Following our discussion on twitter, here is a screenshot of my current alpaca-lora training run (losses are a bit higher because I'm masking out the instruction in the loss):

Screenshot_20230319_122451

I'm starting to drift towards the idea that we should probably abandon the Alpaca dataset entirely once we get a suitable SFT dataset from the Open-Assistant project, or at least diversify the seed prompts in the original repo.

tloen avatar Mar 19 '23 19:03 tloen

Following our discussion on twitter, here is a screenshot of my current alpaca-lora training run (losses are a bit higher because I'm masking out the instruction in the loss):

Screenshot_20230319_122451

I'm starting to drift towards the idea that we should probably abandon the Alpaca dataset entirely once we get a suitable SFT dataset from the Open-Assistant project, or at least diversify the seed prompts in the original repo.

Looks better. We could probably improve quality by filtering out duplicate instruction/answer from the dataset by picking the best ones

I’m curious how you did the masking because I did something similar in my run by applying IGNORE_INDEX to the labels up to the instruction prompt length

Just realized your loss is still a bit of a flatline like my previous run, I think validation loss will show that it is overfitting

abacaj avatar Mar 19 '23 20:03 abacaj

Maybe tangentially related, but @tloen curious why you might want to leave typos in the dataset (per https://github.com/tloen/alpaca-lora/pull/32#issuecomment-1474454667)

samching avatar Mar 20 '23 01:03 samching

Maybe tangentially related, but @tloen curious why you might want to leave typos in the dataset (per #32 (comment))

Not my place to respond, but I would say leaving typos in the prompt makes it understand the typo should be thought of as what it is meant to be, and respond accordingly

teknium1 avatar Mar 20 '23 04:03 teknium1

Maybe tangentially related, but @tloen curious why you might want to leave typos in the dataset (per #32 (comment))

Not my place to respond, but I would say leaving typos in the prompt makes it understand the typo should be thought of as what it is meant to be, and respond accordingly

Makes sense to me as well for the prompt, the output dataset should aim to be correct

abacaj avatar Mar 20 '23 05:03 abacaj

Maybe tangentially related, but @tloen curious why you might want to leave typos in the dataset (per #32 (comment))

Not my place to respond, but I would say leaving typos in the prompt makes it understand the typo should be thought of as what it is meant to be, and respond accordingly

Makes sense to me as well for the prompt, the output dataset should aim to be correct

I agree with that forsure.

teknium1 avatar Mar 20 '23 10:03 teknium1

LAION's dataset can be found here https://github.com/LAION-AI/Anh/tree/main/data in case anyone wants to give a try for it in training!

Wingie avatar Mar 20 '23 11:03 Wingie

LAION's dataset can be found here https://github.com/LAION-AI/Anh/tree/main/data in case anyone wants to give a try for it in training!

Interesting - it looks like 100K lines of User: | Assistant: input / ouput pairs, pulled from different dataset sources. I wonder if this represents the latest from these efforts?

samching avatar Mar 20 '23 21:03 samching

I started a new effort to try and clean up the current alpaca dataset https://github.com/gururise/AlpacaDataCleaned

gururise avatar Mar 21 '23 18:03 gururise

I am working on putting together a FLAN dataset as well to upload to the HF hub.

Training a 7B and 13B llama model on OIG at bf16 no LORA. Will have those out soon.

conceptofmind avatar Mar 22 '23 15:03 conceptofmind

Maybe tangentially related, but @tloen curious why you might want to leave typos in the dataset (per #32 (comment))

Not my place to respond, but I would say leaving typos in the prompt makes it understand the typo should be thought of as what it is meant to be, and respond accordingly

My intuition is we should keep the training data scoped and focused. Correct all typos for the training data that does not cover the skill of correcting wrong spellings. Create more (there are some already) training prompts specifically focused on understanding the transition from:

  1. Identifying wrong spelling input
  2. Correct spelling from context
  3. Understanding corrected input

claysauruswrecks avatar Mar 24 '23 04:03 claysauruswrecks

I've opened #152 to start the process of vendoring datasets in other repos.

I went through all the history for alpaca_data_cleaned.json in this repo to make sure the big fixes were in the vendored submodule.

Next, I will go through and improve the training prompts in @gururise repo.

claysauruswrecks avatar Mar 24 '23 07:03 claysauruswrecks

Uploaded these so far for Flan: https://huggingface.co/datasets/conceptofmind/flan_niv2_zsopt https://huggingface.co/datasets/conceptofmind/flan_cot_fsopt https://huggingface.co/datasets/conceptofmind/flan_cot_zsopt https://huggingface.co/datasets/conceptofmind/flan_cot_submix

conceptofmind avatar Mar 25 '23 02:03 conceptofmind