xgen
xgen copied to clipboard
Release of Training Data
Hi, could you please release the training data too, to enable further research into the model behavior ? Other projects like EleuterAI's pythia project have done that, which has helped get more interest and usage for those models.
Sorry that we are not able to release the training data. Most of our training data can be found in https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T, https://pile.eleuther.ai/ and https://huggingface.co/datasets/wikipedia. We used https://github.com/google-research/text-to-text-transfer-transformer#c4 to get more C4 data.