Teknium
Teknium
Awesome. I hope to contribute
> My thought is the following: > > `alpaca_data_cleaned.json` this should only contain fixes for issues with obvious errors. It will serve as the **base** cleaned alpaca dataset. > >...
In regards to the above, when running that prompt through GPT-4 API, you get data such as this, already formated into json, like the original training set afaik. [{"instruction": "Rewrite...
> Nice, here are some generated examples that might mesh well with yours: > > ``` > Certainly! Here are example prompts that an expert prompt engineer with the mentioned...
> > The one issue I have with this is that I think all new datasets should conform to alpaca dataset's format, i.e., with just an instruction, input, and response...
Another dataset has been produced for code generating instruct: https://github.com/sahil280114/codealpaca
And guanaco dataset here which is basically rebuilt alpaca set but with gpt3.5 instead of davinci: https://github.com/IntoThatGoodNight/Guanaco-Dataset
> > And guanaco dataset here which is basically rebuilt alpaca set but with gpt3.5 instead of davinci: https://github.com/IntoThatGoodNight/Guanaco-Dataset > > Interesting. Do you know if the same Stanford `seed_tasks.jsonl`...
Yet another generated dataset to keep an eye on: https://github.com/vaguenebula/AlpacaDataReflect It is a dataset that used gpt3.5 (I believe) to critique each response from alpaca dataset
Here's an 800k 3.5-turbo dataset (and lora) https://github.com/nomic-ai/gpt4all