👩🤝🤖 awesome-llm-datasets

This repository is a collection of useful links related to datasets for Language Model models (LLMs) and Reinforcement Learning with Human Feedback (RLHF).

It includes a variety of open datasets, as well as tools, pre-trained models, and research papers that can help researchers and developers work with LLMs and RLHF from a data perspective.

Follow and star for the latest and greatest links related to datasets for LLMs and RLHF.

📦 Datasets
1. 📚 For pre-training
  1. 2023
  2. Before 2023
2. 🗣️ For instruction-tuning
3. 👩🤝🤖 For RLHF
4. ⚖️ For evaluation
5. 👽 For other purposes
🦾 Models and their datasets
🧰 Tools and methods
📔 Papers

Datasets

For pre-training

2023

RedPajama Data:

1.2 Trillion tokens Dataset in English:

Dataset	Token Count
Commoncrawl	878 Billion
C4	175 Billion
GitHub	59 Billion
Books	26 Billion
ArXiv	28 Billion
Wikipedia	24 Billion
StackExchange	20 Billion
Total	1.2 Trillion

Also includes code for data preparation, deduplication, tokenization, and visualization.

Created by Ontocord.ai, MILA Québec AI Institute, ETH DS3Lab, Université de Montréal, Stanford Center for Research on Foundation Models (CRFM), Stanford Hazy Research research group and LAION.

Before 2023

For instruction-tuning

For RLHF & Alignment

For evaluation

For other purposes

Models and their datasets

LLaMA

Overview: A collection of open source foundation models ranging in size from 7B to 65B parameters released by Meta AI.

License: Non-commercial bespoke (model), GPL-3.0 (code)

📝 Release blog post 📄 arXiv publication 🃏 Model card

Vicuna

Overview: A 13B parameter open source chatbot model fine-tuned on LLaMA and ~70k ChatGPT conversations that maintains 92% of ChatGPT’s performance and outperforms LLaMA and Alpaca.

License: Non-commercial bespoke license (model), Apache 2.0 (code).

📦 Repo

📝 Release blog post

🔗 ShareGPT dataset

🤗 Models

🤖 Gradio demo

Dolly 2.0

Overview: A fully open source 12B parameter instruction-following LLM, fine-tuned on a human-generated instruction dataset licensed for research and commercial use.

License: CC BY-SA 3.0 (model), CC BY-SA 3.0 (dataset), Apache 2.0 (code).

📦 Repo

📝 Release blog post

🤗 Models

LLaVA

Overview: A multi-modal LLM that combines a vision encoder and Vicuna for general-purpose visual and language understanding, with capabilities similar to GPT-4.

License: Non-commercial bespoke (model), CC BY NC 4.0 (dataset), Apache 2.0 (code).

📦 Repo

📝 Project homepage

📄 arXiv publication

🤗 Dataset & models

🤖 Gradio demo

StableLM

Overview: A suite of low-parameter (3B, 7B) LLMs trained on a new dataset built on The Pile, with 1.5 trillion tokens of content.

License: CC BY-SA-4.0 (models).

📦 Repo

📝 Release blog post

🤗 Models

🤖 Gradio demo

Alpaca

Overview: A partially open source instruction-following model fine-tuned on LLaMA which is smaller and cheaper and performs similarly to GPT-3.5.

License: Non-commercial bespoke (model), CC BY-NC 4.0 (dataset), Apache 2.0 (code).

📝 Release blog post

🤗 Dataset

awesome-llm-datasets
awesome-llm-datasets copied to clipboard

Metadata

👩🤝🤖 awesome-llm-datasets

Table of Contents

Datasets

For pre-training

2023

Before 2023

For instruction-tuning

For RLHF & Alignment

For evaluation

For other purposes

Models and their datasets

LLaMA

Vicuna

Dolly 2.0

LLaVA

StableLM

Alpaca

Tools and methods

Papers

← Metadata

Owner

Metadata

awesome-llm-datasets awesome-llm-datasets copied to clipboard

Metadata

👩🤝🤖 awesome-llm-datasets

Table of Contents

Datasets

For pre-training

2023

Before 2023

For instruction-tuning

For RLHF & Alignment

For evaluation

For other purposes

Models and their datasets

LLaMA

Vicuna

Dolly 2.0

LLaVA

StableLM

Alpaca

Tools and methods

Papers

← Metadata

Owner

Metadata

awesome-llm-datasets
awesome-llm-datasets copied to clipboard