alpaca-lora icon indicating copy to clipboard operation
alpaca-lora copied to clipboard

Fix few issues with the dataset

Open gururise opened this issue 2 years ago • 6 comments

Being that the training dataset was generated through GPT3, there seem to have been several issues I noticed when going through it. I have manually fixed the following issues:

  • Resolve empty outputs
  • Added a few CoT examples
  • Fixed a few empty code examples
  • Removed instructions asking to generate images
  • Resolve N/A outputs
  • Make empty inputs consistent (some used N/A, others uses None)
  • Fixed a few wrong answers.

Hoping this slightly curated dataset will help produce better training results.

gururise avatar Mar 17 '23 01:03 gururise

Very interesting — I hadn't realized there were so many holes in the data. Fixing them could improve the model quality significantly. Out of curiosity, how many examples did you view and was there any method to your approach?

tloen avatar Mar 17 '23 02:03 tloen

  • Fixed a few more issues.
  • Put the "visualization" tasks back with a standard response: "As a large language model, I am unable to generate visual data."

Noticed there are a several tasks that expect the LLM to use data from URL's. Many of which don't even exist. I've replaced equivalent data when available.

gururise avatar Mar 17 '23 05:03 gururise

Very interesting — I hadn't realized there were so many holes in the data. Fixing them could improve the model quality significantly. Out of curiosity, how many examples did you view and was there any method to your approach?

I only gave a cursory look and fixed the very obvious issues (ie. inconsistent empty input, obviously wrong answers, blank outputs, etc). I probably manually went through a few hundred examples.

I think I got most of the low-hanging fruit via searching for empty inputs and blank outputs. I did notice there are many instructions asking the LLM to reference online data to answer a question. These should probably be addressed in some manner.

gururise avatar Mar 17 '23 05:03 gururise

I’m not sure if this is the right place to ask but I was thinking of crowdsourcing updating of each response in the training data set with functions to approve and review each line

niclimcy avatar Mar 17 '23 08:03 niclimcy

Could contribute a simple system to accept/decline/upsert the entries

(Imagine each card in this kanban board beeing one instruction -> answer pair each)

grafik

Instead of category it would be a free form text field with the data from the original dataset that a reviewer can edit

grafik

chris-aeviator avatar Mar 17 '23 08:03 chris-aeviator

Instead of providing generic answers like "As a large language model, I am unable to..." we could introduce a standardized set of tools, that could potentially improve the accuracy of certain types of responses, such as calculations, image generation, or code compilation. The model should propose tools and use their output instead of relying solely on the language model's internal capabilities (which could be a big limitation considering the model size).

One can still detect the tool usage and replace it with generic answer if necessary.

zkenda avatar Mar 17 '23 11:03 zkenda

To assist with this, I made an embedding space explorer (running the data through a transformer) for visualizing the instructions and outputs.

Training Data Instructions Latent Space: https://atlas.nomic.ai/map/alpaca_instructions Training Data Outputs: https://atlas.nomic.ai/map/alpaca_outputs

For example, here is a link to a bunch of bad data points in the outputs: https://atlas.nomic.ai/map/d2139cc3-bc1c-441c-8d6f-3e6ffbbc2eda/838019ff-8fe2-42ba-809a-d86d2b98cd50/-18.11668742841587/-11.348087116836096/-20.88850316347706/-17.680468640801223/774455612

AndriyMulyar avatar Mar 17 '23 18:03 AndriyMulyar

The original Stanford Dataset is full of mistakes and holes. Another large issue I found was that many of the instructions hallucinated references to article URL's.

I made my best effort first pass through the dataset to clean it up:

  • Resolve empty outputs
  • Resolve empty inputs (no input, , n/a, etc.) for consistency
  • Added several CoT examples (from Google's FLAN paper)
  • Fixed a few empty code examples
  • Instructions to Generate Audio or Images default to message stating as an llm I can't do this.
  • Resolve N/A outputs
  • Fixed a few wrong answers.
  • Did my best to either insert actual text for URL's referring to articles, or replace them with an alternate instruction.
  • Remove several instructions asking the LLM to pull data from the internet.
  • Removed extraneous escape/ctrl characters in some answers

The patched dataset is much more consistent and no longer assumes the LLM can access the internet or view/generate visual data. It also now has a few CoT training examples. Would be interested to see how training goes on this updated dataset.

gururise avatar Mar 17 '23 19:03 gururise

I spent some time thinking about how to crowdsource dataset cleaning with minimal tooling. One way to do this is to create a separate repo with the following structure:

  • stanford_dataset.jsonl: a copy of the Alpaca dataset augmented with an id field for identification across versions
  • reviews: a folder of human-submitted data reviews
  • clean.py: a script or web interface that randomly samples unreviewed data points from stanford_dataset.jsonl for reviewing, then writes the edited or approved example to a new jsonl file in reviews
  • combine.py: a script that applies all the changes in reviews to the original dataset, and outputs a new cleaned_dataset.jsonl.

I suppose the utility of such an approach would depend on how many bad data points remain. In the meantime, I'll review the changes made so far and save a new "cleaned" dataset alongside the existing one.

tloen avatar Mar 17 '23 20:03 tloen

Would the dataset benefit from multiple prompt:response chains rather than just single prompt>response? i.e. Question:Answer:FollowupQ:FollowupA

teknium1 avatar Mar 17 '23 20:03 teknium1

Would the dataset benefit from multiple prompt:response chains rather than just single prompt>response? i.e. Question:Answer:FollowupQ:FollowupA

That's a lot of work to build. I'd hold out for that 22k dataset that LAION used to train SFT-1.

tloen avatar Mar 17 '23 20:03 tloen

Folded into f7044049ab7916d38cbc84d16b323a8ecb1c6b58. Thanks for your work!

tloen avatar Mar 17 '23 22:03 tloen

Looks like this just closed as I was typing but there is an typo not to far into the file which I'm not sure intentional or not.

construciton instead of construction

https://github.com/tloen/alpaca-lora/blob/main/alpaca_data_cleaned.json#L23

spAnser avatar Mar 17 '23 22:03 spAnser

8aecde83cd01f3c9a1a9526f057fb90018ee9cc9

tloen avatar Mar 17 '23 22:03 tloen

Although honestly we might want to leave typos in the instructions.

tloen avatar Mar 17 '23 22:03 tloen

Yeah it might be worth it idk.

spAnser avatar Mar 17 '23 22:03 spAnser

for prompts it seems a good idea to keep typos

teknium1 avatar Mar 17 '23 23:03 teknium1

People should really support LAION's open-assistant.io project, because every person helping there, will improve a fully curated, crowd sourced, open sourced instruction fine tuning dataset, which in turn can be used for alpaca fine tuning.

underlines avatar Mar 19 '23 18:03 underlines

FYI, the dataset cleaning is on-going. Latest cleaned dataset can be accessed here.

gururise avatar Mar 22 '23 18:03 gururise

Instead of providing generic answers like "As a large language model, I am unable to..." we could introduce a standardized set of tools

Good idea, meta is already working on it with toolformer and there are a few other efforts too, for example getting it to control a web browser. They help but not as much as you would expect at the moment (red is baseline, blue is with a calculator). Since it's a WIP I would guess it's outside the scope of this repo for now.

image

wassname avatar Mar 22 '23 23:03 wassname