Open-Assistant icon indicating copy to clipboard operation
Open-Assistant copied to clipboard

Dataset: RedPajama

Open iAdanos opened this issue 2 years ago • 4 comments

RedPajama is an open dataset containing more than 1.2 trillion tokens - https://www.together.xyz/blog/redpajama. It has a permissive license and lots of data, so it would invest a lot of knowledge into the project. Also, it would permit to switch from llama-based model to a custom one or, for example, a Dolly-based one.

Github: https://github.com/togethercomputer/RedPajama-Data Huggingface: https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T

iAdanos avatar Apr 20 '23 15:04 iAdanos

This dataset would be useful for pretraining rather than instruction-tuning. Pretraining is very expensive and requires huge amounts of compute which OA cannot currently commit, so we are exclusively finetuning from existing models

olliestanley avatar Apr 20 '23 15:04 olliestanley

The RedPajama project by Together and several other organizations is supposed to have, according to their article, 3 components:

  1. Pretraining data
  2. Base models
  3. Instruction tuning data and models to make them usable and safe

Pretraining data cannot be used directly for the reason stated by Oliver. I think what we should hope for is to have high-quality base models (which should be the next asset), which could be finetuned for Open Assistant, to replace Llama, or at least to have another open source variant besides Pythia.

I am skeptical of the third asset because it's unclear what "safe" would imply here (the word appears exactly once in the whole article). Often alignment with the perception of safety of some organization reduces utility and renounces neutrality. I am a lot more optimistic about the "Base model" asset and I truly hope they will be a viable base for Open Assistant in the near future.

Aspie96 avatar Apr 20 '23 18:04 Aspie96

The RedPajama project by Together and several other organizations is supposed to have, according to their article, 3 components:

  1. Pretraining data
  2. Base models
  3. Instruction tuning data and models to make them usable and safe

Pretraining data cannot be used directly for the reason stated by Oliver. I think what we should hope for is to have high-quality base models (which should be the next asset), which could be finetuned for Open Assistant, to replace Llama, or at least to have another open source variant besides Pythia.

I am skeptical of the third asset because it's unclear what "safe" would imply here (the word appears exactly once in the whole article). Often alignment with the perception of safety of some organization reduces utility and renounces neutrality. I am a lot more optimistic about the "Base model" asset and I truly hope they will be a viable base for Open Assistant in the near future.

Yes agree, we can definitely look at their models and maybe instruction data when those are available, my comment only applies to the pretraining text corpus

olliestanley avatar Apr 20 '23 19:04 olliestanley

This dataset can indeed also be helpful during SFT potentially reduce or delay overfitting, especially the small sample seems interesting ... the distribution is close to llama training set and would be continuation of training with simple language-modelling objective.

In general for pre-training my impression is that RedPajama ideally should have included more code (e.g. it has less than LLaMA according to numbers I saw) ...

andreaskoepf avatar Apr 21 '23 16:04 andreaskoepf