alpaca-lora
                                
                                
                                
                                    alpaca-lora copied to clipboard
                            
                            
                            
                        Fix few issues with the dataset
Being that the training dataset was generated through GPT3, there seem to have been several issues I noticed when going through it. I have manually fixed the following issues:
- Resolve empty outputs
 - Added a few CoT examples
 - Fixed a few empty code examples
 - Removed instructions asking to generate images
 - Resolve N/A outputs
 - Make empty inputs consistent (some used N/A, others uses None)
 - Fixed a few wrong answers.
 
Hoping this slightly curated dataset will help produce better training results.
Very interesting — I hadn't realized there were so many holes in the data. Fixing them could improve the model quality significantly. Out of curiosity, how many examples did you view and was there any method to your approach?
- Fixed a few more issues.
 - Put the "visualization" tasks back with a standard response: "As a large language model, I am unable to generate visual data."
 
Noticed there are a several tasks that expect the LLM to use data from URL's. Many of which don't even exist. I've replaced equivalent data when available.
Very interesting — I hadn't realized there were so many holes in the data. Fixing them could improve the model quality significantly. Out of curiosity, how many examples did you view and was there any method to your approach?
I only gave a cursory look and fixed the very obvious issues (ie. inconsistent empty input, obviously wrong answers, blank outputs, etc). I probably manually went through a few hundred examples.
I think I got most of the low-hanging fruit via searching for empty inputs and blank outputs. I did notice there are many instructions asking the LLM to reference online data to answer a question. These should probably be addressed in some manner.
I’m not sure if this is the right place to ask but I was thinking of crowdsourcing updating of each response in the training data set with functions to approve and review each line
Could contribute a simple system to accept/decline/upsert the entries
(Imagine each card in this kanban board beeing one instruction -> answer pair each)

Instead of category it would be a free form text field with the data from the original dataset that a reviewer can edit

Instead of providing generic answers like "As a large language model, I am unable to..." we could introduce a standardized set of tools, that could potentially improve the accuracy of certain types of responses, such as calculations, image generation, or code compilation. The model should propose tools and use their output instead of relying solely on the language model's internal capabilities (which could be a big limitation considering the model size).
One can still detect the tool usage and replace it with generic answer if necessary.
To assist with this, I made an embedding space explorer (running the data through a transformer) for visualizing the instructions and outputs.
Training Data Instructions Latent Space: https://atlas.nomic.ai/map/alpaca_instructions Training Data Outputs: https://atlas.nomic.ai/map/alpaca_outputs
For example, here is a link to a bunch of bad data points in the outputs: https://atlas.nomic.ai/map/d2139cc3-bc1c-441c-8d6f-3e6ffbbc2eda/838019ff-8fe2-42ba-809a-d86d2b98cd50/-18.11668742841587/-11.348087116836096/-20.88850316347706/-17.680468640801223/774455612
The original Stanford Dataset is full of mistakes and holes. Another large issue I found was that many of the instructions hallucinated references to article URL's.
I made my best effort first pass through the dataset to clean it up:
- Resolve empty outputs
 - Resolve empty inputs (no input, 
, n/a, etc.) for consistency  - Added several CoT examples (from Google's FLAN paper)
 - Fixed a few empty code examples
 - Instructions to Generate Audio or Images default to message stating as an llm I can't do this.
 - Resolve N/A outputs
 - Fixed a few wrong answers.
 - Did my best to either insert actual text for URL's referring to articles, or replace them with an alternate instruction.
 - Remove several instructions asking the LLM to pull data from the internet.
 - Removed extraneous escape/ctrl characters in some answers
 
The patched dataset is much more consistent and no longer assumes the LLM can access the internet or view/generate visual data. It also now has a few CoT training examples. Would be interested to see how training goes on this updated dataset.
I spent some time thinking about how to crowdsource dataset cleaning with minimal tooling. One way to do this is to create a separate repo with the following structure:
stanford_dataset.jsonl: a copy of the Alpaca dataset augmented with anidfield for identification across versionsreviews: a folder of human-submitted data reviewsclean.py: a script or web interface that randomly samples unreviewed data points fromstanford_dataset.jsonlfor reviewing, then writes the edited or approved example to a newjsonlfile inreviewscombine.py: a script that applies all the changes inreviewsto the original dataset, and outputs a newcleaned_dataset.jsonl.
I suppose the utility of such an approach would depend on how many bad data points remain. In the meantime, I'll review the changes made so far and save a new "cleaned" dataset alongside the existing one.
Would the dataset benefit from multiple prompt:response chains rather than just single prompt>response? i.e. Question:Answer:FollowupQ:FollowupA
Would the dataset benefit from multiple prompt:response chains rather than just single prompt>response? i.e. Question:Answer:FollowupQ:FollowupA
That's a lot of work to build. I'd hold out for that 22k dataset that LAION used to train SFT-1.
Folded into f7044049ab7916d38cbc84d16b323a8ecb1c6b58. Thanks for your work!
Looks like this just closed as I was typing but there is an typo not to far into the file which I'm not sure intentional or not.
construciton instead of construction
https://github.com/tloen/alpaca-lora/blob/main/alpaca_data_cleaned.json#L23
8aecde83cd01f3c9a1a9526f057fb90018ee9cc9
Although honestly we might want to leave typos in the instructions.
Yeah it might be worth it idk.
for prompts it seems a good idea to keep typos
People should really support LAION's open-assistant.io project, because every person helping there, will improve a fully curated, crowd sourced, open sourced instruction fine tuning dataset, which in turn can be used for alpaca fine tuning.
FYI, the dataset cleaning is on-going. Latest cleaned dataset can be accessed here.
Instead of providing generic answers like "As a large language model, I am unable to..." we could introduce a standardized set of tools
Good idea, meta is already working on it with toolformer and there are a few other efforts too, for example getting it to control a web browser. They help but not as much as you would expect at the moment (red is baseline, blue is with a calculator). Since it's a WIP I would guess it's outside the scope of this repo for now.
