alpaca-lora
alpaca-lora copied to clipboard
Implement automatic Checking the Alpaca Data and suggest changes
Description: This pull request adds a new feature that integrates OpenAI's API, specifically GPT-3.5-turbo and GPT-4, to check the correctness of data from Alpaca. The code reads a JSON file containing Alpaca data, sends it to the OpenAI API, and receives the model's response on whether the data is correct or not, along with a reason for the judgment.
Note that this code was created for helping a human filter those rows that could need cleaning or improvement.
Main changes:
- Add new dependencies: openai, pandas, and tqdm.
- Define a Config class to read and store parameters from a configuration file.
- Create functions to read Alpaca data from a JSON file and store it as a pandas DataFrame.
- Implement a function, openai_gpt, that sends a prompt to the OpenAI GPT API and returns the response, handling potential exceptions and retries.
- Generate prompts for each row of the DataFrame and send them to the OpenAI API.
- Save the model's responses and other relevant information to a CSV file.
This is a markdown table representation of the format of the data at the end of the process
Column | Description |
---|---|
instruction | The instruction is given for the specific Alpaca data |
input | The input data provided |
output | The output data generated |
response_gpt | The response from the OpenAI GPT model (yes/no and explanation) |
model | The LLM GPT model used (for the moment, GPT-3, GPT-3.5 and GPT-4) |
gpt_check | A boolean indicating if the GPT model agrees (True/False) |
reason | The reason provided by the GPT model for its judgment |
The final data is saved in a CSV file with this structure, which includes the original Alpaca data, the model's responses, and the extracted boolean value and reason for the judgment.
Testing: The code has been tested with various Alpaca data samples, and the results were saved successfully in the output CSV file.
I have generated a .csv with the first 1K rows of the alpaca_data.json using gpt-3.5 This is an example of the table generated after checking the first 1K rows of alpaca_data.json
The application checks the output and gives an explanation when there is a disagreement.
This is really nice, I might do it starting from the back and see if I can update some more data in as well.
Please, @tloen let me know if this is relevant here. The main goal is to improve the quality of the data and measure the impact of that "automatic" task on the performance of the model. Thanks.
Sorry for the slow response, lots flying around on my end. I'm happy to merge the changes to the dataset — thanks for your work!
The OpenAI script is impressive, but I think it's beyond the scope of this repo. There are better supervised fine-tuning datasets in the pipeline and I don't want to encourage people to put in too much duplicate effort, particularly when it means spending on OpenAI.
I understand. I had my doubts as well, but this was a good opportunity to check if an automatic script could work. If you know where it can be useful, will happily share it with other projects so more people can use it, as it seems to be a useful tool. Thanks.
Yeah, it's a very clever idea. I've also wondered about whether a model like Alpaca itself could be used to clean its own training data for subsequent runs. If so, it would make bootstrapping even easier!
Maybe it can be the case, using several attempts for the same input and taking the one that is more common. In my script, I used temp = 0, but I think temp > 0 can be useful as well.
Should I close this PR?, please let me know to clean up the opened PRs in the project.
Cheers, Jose
@josemlopez Would be happy to take your PR here: https://github.com/gururise/AlpacaDataCleaned