awesome-llm-datasets icon indicating copy to clipboard operation
awesome-llm-datasets copied to clipboard

πŸ‘©πŸ€πŸ€– A curated list of datasets for large language models (LLMs), RLHF and related resources (continually updated)

πŸ‘©πŸ€πŸ€– awesome-llm-datasets

This repository is a collection of useful links related to datasets for Language Model models (LLMs) and Reinforcement Learning with Human Feedback (RLHF).

It includes a variety of open datasets, as well as tools, pre-trained models, and research papers that can help researchers and developers work with LLMs and RLHF from a data perspective.

Follow and star for the latest and greatest links related to datasets for LLMs and RLHF.

Table of Contents

  1. πŸ“¦ Datasets
    1. πŸ“š For pre-training
      1. 2023
      2. Before 2023
    2. πŸ—£οΈ For instruction-tuning
    3. πŸ‘©πŸ€πŸ€– For RLHF
    4. βš–οΈ For evaluation
    5. πŸ‘½ For other purposes
  2. 🦾 Models and their datasets
  3. 🧰 Tools and methods
  4. πŸ“” Papers

Datasets

For pre-training

2023

RedPajama Data:

1.2 Trillion tokens Dataset in English:

Dataset Token Count
Commoncrawl 878 Billion
C4 175 Billion
GitHub 59 Billion
Books 26 Billion
ArXiv 28 Billion
Wikipedia 24 Billion
StackExchange 20 Billion
Total 1.2 Trillion

Also includes code for data preparation, deduplication, tokenization, and visualization.

Created by Ontocord.ai, MILA QuΓ©bec AI Institute, ETH DS3Lab, UniversitΓ© de MontrΓ©al, Stanford Center for Research on Foundation Models (CRFM), Stanford Hazy Research research group and LAION.

Before 2023

For instruction-tuning

For RLHF & Alignment

For evaluation

For other purposes

Models and their datasets

LLaMA

Overview: A collection of open source foundation models ranging in size from 7B to 65B parameters released by Meta AI.

License: Non-commercial bespoke (model), GPL-3.0 (code)

πŸ“ Release blog post πŸ“„ arXiv publication πŸƒ Model card

Vicuna

Overview: A 13B parameter open source chatbot model fine-tuned on LLaMA and ~70k ChatGPT conversations that maintains 92% of ChatGPT’s performance and outperforms LLaMA and Alpaca.

License: Non-commercial bespoke license (model), Apache 2.0 (code).

πŸ“¦ Repo

πŸ“ Release blog post

πŸ”— ShareGPT dataset

πŸ€— Models

πŸ€– Gradio demo

Dolly 2.0

Overview: A fully open source 12B parameter instruction-following LLM, fine-tuned on a human-generated instruction dataset licensed for research and commercial use.

License: CC BY-SA 3.0 (model), CC BY-SA 3.0 (dataset), Apache 2.0 (code).

πŸ“¦ Repo

πŸ“ Release blog post

πŸ€— Models

LLaVA

Overview: A multi-modal LLM that combines a vision encoder and Vicuna for general-purpose visual and language understanding, with capabilities similar to GPT-4.

License: Non-commercial bespoke (model), CC BY NC 4.0 (dataset), Apache 2.0 (code).

πŸ“¦ Repo

πŸ“ Project homepage

πŸ“„ arXiv publication

πŸ€— Dataset & models

πŸ€– Gradio demo

StableLM

Overview: A suite of low-parameter (3B, 7B) LLMs trained on a new dataset built on The Pile, with 1.5 trillion tokens of content.

License: CC BY-SA-4.0 (models).

πŸ“¦ Repo

πŸ“ Release blog post

πŸ€— Models

πŸ€– Gradio demo

Alpaca

Overview: A partially open source instruction-following model fine-tuned on LLaMA which is smaller and cheaper and performs similarly to GPT-3.5.

License: Non-commercial bespoke (model), CC BY-NC 4.0 (dataset), Apache 2.0 (code).

πŸ“ Release blog post

πŸ€— Dataset

Tools and methods

Papers