alpaca-lora Implement automatic Checking the Alpaca Data and suggest changes

Description: This pull request adds a new feature that integrates OpenAI's API, specifically GPT-3.5-turbo and GPT-4, to check the correctness of data from Alpaca. The code reads a JSON file containing Alpaca data, sends it to the OpenAI API, and receives the model's response on whether the data is correct or not, along with a reason for the judgment.

Note that this code was created for helping a human filter those rows that could need cleaning or improvement.

Main changes:

Add new dependencies: openai, pandas, and tqdm.
Define a Config class to read and store parameters from a configuration file.
Create functions to read Alpaca data from a JSON file and store it as a pandas DataFrame.
Implement a function, openai_gpt, that sends a prompt to the OpenAI GPT API and returns the response, handling potential exceptions and retries.
Generate prompts for each row of the DataFrame and send them to the OpenAI API.
Save the model's responses and other relevant information to a CSV file.

This is a markdown table representation of the format of the data at the end of the process

Column	Description
instruction	The instruction is given for the specific Alpaca data
input	The input data provided
output	The output data generated
response_gpt	The response from the OpenAI GPT model (yes/no and explanation)
model	The LLM GPT model used (for the moment, GPT-3, GPT-3.5 and GPT-4)
gpt_check	A boolean indicating if the GPT model agrees (True/False)
reason	The reason provided by the GPT model for its judgment

The final data is saved in a CSV file with this structure, which includes the original Alpaca data, the model's responses, and the extracted boolean value and reason for the judgment.

Testing: The code has been tested with various Alpaca data samples, and the results were saved successfully in the output CSV file.

Mar 18 '23 20:03 josemlopez

I have generated a .csv with the first 1K rows of the alpaca_data.json using gpt-3.5 This is an example of the table generated after checking the first 1K rows of alpaca_data.json

The application checks the output and gives an explanation when there is a disagreement.

Mar 18 '23 20:03 josemlopez

This is really nice, I might do it starting from the back and see if I can update some more data in as well.

Mar 19 '23 21:03 EllangoK

Please, @tloen let me know if this is relevant here. The main goal is to improve the quality of the data and measure the impact of that "automatic" task on the performance of the model. Thanks.

Mar 20 '23 09:03 josemlopez

Sorry for the slow response, lots flying around on my end. I'm happy to merge the changes to the dataset — thanks for your work!

The OpenAI script is impressive, but I think it's beyond the scope of this repo. There are better supervised fine-tuning datasets in the pipeline and I don't want to encourage people to put in too much duplicate effort, particularly when it means spending on OpenAI.

Mar 20 '23 21:03 tloen

I understand. I had my doubts as well, but this was a good opportunity to check if an automatic script could work. If you know where it can be useful, will happily share it with other projects so more people can use it, as it seems to be a useful tool. Thanks.

Mar 20 '23 21:03 josemlopez

Yeah, it's a very clever idea. I've also wondered about whether a model like Alpaca itself could be used to clean its own training data for subsequent runs. If so, it would make bootstrapping even easier!

Mar 20 '23 22:03 tloen

Maybe it can be the case, using several attempts for the same input and taking the one that is more common. In my script, I used temp = 0, but I think temp > 0 can be useful as well.

Should I close this PR?, please let me know to clean up the opened PRs in the project.

Cheers, Jose

Mar 21 '23 07:03 josemlopez

@josemlopez Would be happy to take your PR here: https://github.com/gururise/AlpacaDataCleaned

Mar 21 '23 23:03 gururise

alpaca-lora alpaca-lora copied to clipboard

Implement automatic Checking the Alpaca Data and suggest changes

alpaca-lora
alpaca-lora copied to clipboard