Teknium comments

Results 81 comments of


                                            Teknium

Any chance we could improve the dataset beyond fixing?

Awesome. I hope to contribute

Any chance we could improve the dataset beyond fixing?

> My thought is the following: > > `alpaca_data_cleaned.json` this should only contain fixes for issues with obvious errors. It will serve as the **base** cleaned alpaca dataset. > >...

Any chance we could improve the dataset beyond fixing?

In regards to the above, when running that prompt through GPT-4 API, you get data such as this, already formated into json, like the original training set afaik. [{"instruction": "Rewrite...

Any chance we could improve the dataset beyond fixing?

> Nice, here are some generated examples that might mesh well with yours: > > ``` > Certainly! Here are example prompts that an expert prompt engineer with the mentioned...

Any chance we could improve the dataset beyond fixing?

> > The one issue I have with this is that I think all new datasets should conform to alpaca dataset's format, i.e., with just an instruction, input, and response...

Any chance we could improve the dataset beyond fixing?

Another dataset has been produced for code generating instruct: https://github.com/sahil280114/codealpaca

Any chance we could improve the dataset beyond fixing?

And guanaco dataset here which is basically rebuilt alpaca set but with gpt3.5 instead of davinci: https://github.com/IntoThatGoodNight/Guanaco-Dataset

Any chance we could improve the dataset beyond fixing?

> > And guanaco dataset here which is basically rebuilt alpaca set but with gpt3.5 instead of davinci: https://github.com/IntoThatGoodNight/Guanaco-Dataset > > Interesting. Do you know if the same Stanford `seed_tasks.jsonl`...

Any chance we could improve the dataset beyond fixing?

Yet another generated dataset to keep an eye on: https://github.com/vaguenebula/AlpacaDataReflect It is a dataset that used gpt3.5 (I believe) to critique each response from alpaca dataset

Any chance we could improve the dataset beyond fixing?

Here's an 800k 3.5-turbo dataset (and lora) https://github.com/nomic-ai/gpt4all