xgen icon indicating copy to clipboard operation
xgen copied to clipboard

Release of Training Data

Open stabilize-ai opened this issue 2 years ago • 1 comments

Hi, could you please release the training data too, to enable further research into the model behavior ? Other projects like EleuterAI's pythia project have done that, which has helped get more interest and usage for those models.

stabilize-ai avatar Jun 29 '23 04:06 stabilize-ai

Sorry that we are not able to release the training data. Most of our training data can be found in https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T, https://pile.eleuther.ai/ and https://huggingface.co/datasets/wikipedia. We used https://github.com/google-research/text-to-text-transfer-transformer#c4 to get more C4 data.

tianxie-9 avatar Jun 29 '23 17:06 tianxie-9