Teknium

Results 81 comments of Teknium

> My thought is the following: > > `alpaca_data_cleaned.json` this should only contain fixes for issues with obvious errors. It will serve as the **base** cleaned alpaca dataset. > >...

In regards to the above, when running that prompt through GPT-4 API, you get data such as this, already formated into json, like the original training set afaik. [{"instruction": "Rewrite...

> Nice, here are some generated examples that might mesh well with yours: > > ``` > Certainly! Here are example prompts that an expert prompt engineer with the mentioned...

> > The one issue I have with this is that I think all new datasets should conform to alpaca dataset's format, i.e., with just an instruction, input, and response...

Another dataset has been produced for code generating instruct: https://github.com/sahil280114/codealpaca

And guanaco dataset here which is basically rebuilt alpaca set but with gpt3.5 instead of davinci: https://github.com/IntoThatGoodNight/Guanaco-Dataset

> > And guanaco dataset here which is basically rebuilt alpaca set but with gpt3.5 instead of davinci: https://github.com/IntoThatGoodNight/Guanaco-Dataset > > Interesting. Do you know if the same Stanford `seed_tasks.jsonl`...

Yet another generated dataset to keep an eye on: https://github.com/vaguenebula/AlpacaDataReflect It is a dataset that used gpt3.5 (I believe) to critique each response from alpaca dataset

Here's an 800k 3.5-turbo dataset (and lora) https://github.com/nomic-ai/gpt4all