alpaca-lora
alpaca-lora copied to clipboard
Some cleaning of the first 1K rows in alpaca_data_cleaned
This is a partial cleaning of the 1K first rows using the tool in this pull request: https://github.com/tloen/alpaca-lora/pull/62 for identifying the potential errors in the dataset. In total, the tool found around 90 potential errors in the first 1K rows.
great initiative!
how much it costed the first 1k rows cleaning using gpt (and what model u used, since there is considerable price change between 3.5 and 4)
The price was really low. I tested yesterday thousands of rows several times (I can't say exactly the number because I was testing and debugging) and it cost 1.29$ in total I think the whole dataset will cost less than 50$ for GPT-3.5. If the rows are checked with GPT-3.5 the cost is low but less precise, when the check is done with GPT-4 the answers are really good. The real task is to check manually all the rows marked. It would be great to have a gradio application showing the rows marked so that it can be edited. If I have the time I'll do it.
Very nice. Did you use GPT-3.5? Would using GPT-4 cost significantly more?
Regarding GPT-4 prices. It is x30 the price in the competition (which is the heavy part here) and x10 the price in Prompt: For the 50K samples, I think it can be around $1.5K (not an exact number OC)

Ok, I'll close this Pull Request, then. If anyone is interested, the repo is: https://github.com/josemlopez/check-with-gpt , there I'll continue with the cleaning to measure the impact on the performance with the different improvements in the dataset.