alpaca-lora icon indicating copy to clipboard operation
alpaca-lora copied to clipboard

Some cleaning of the first 1K rows in alpaca_data_cleaned

Open josemlopez opened this issue 2 years ago • 2 comments

This is a partial cleaning of the 1K first rows using the tool in this pull request: https://github.com/tloen/alpaca-lora/pull/62 for identifying the potential errors in the dataset. In total, the tool found around 90 potential errors in the first 1K rows.

josemlopez avatar Mar 19 '23 08:03 josemlopez

great initiative!

how much it costed the first 1k rows cleaning using gpt (and what model u used, since there is considerable price change between 3.5 and 4)

kesar avatar Mar 19 '23 09:03 kesar

The price was really low. I tested yesterday thousands of rows several times (I can't say exactly the number because I was testing and debugging) and it cost 1.29$ in total I think the whole dataset will cost less than 50$ for GPT-3.5. If the rows are checked with GPT-3.5 the cost is low but less precise, when the check is done with GPT-4 the answers are really good. The real task is to check manually all the rows marked. It would be great to have a gradio application showing the rows marked so that it can be edited. If I have the time I'll do it.

josemlopez avatar Mar 19 '23 10:03 josemlopez

Very nice. Did you use GPT-3.5? Would using GPT-4 cost significantly more?

gururise avatar Mar 19 '23 19:03 gururise

Regarding GPT-4 prices. It is x30 the price in the competition (which is the heavy part here) and x10 the price in Prompt: For the 50K samples, I think it can be around $1.5K (not an exact number OC)

image

josemlopez avatar Mar 19 '23 19:03 josemlopez

Ok, I'll close this Pull Request, then. If anyone is interested, the repo is: https://github.com/josemlopez/check-with-gpt , there I'll continue with the cleaning to measure the impact on the performance with the different improvements in the dataset.

josemlopez avatar Mar 20 '23 09:03 josemlopez