cerebros-core-algorithm-alpha
cerebros-core-algorithm-alpha copied to clipboard
add-dataset-sources-to-acknowledgements
#When text generation POC is complete and ready to merge in, add credits to the training data sources in readme.md
Datasets being used thus far:
- https://archive.org/details/holy-bible-king-james-version-without-chapters-verses-footnotes_202307
Candidates that are being considered (for phase I training):
- https://huggingface.co/datasets/PleIAs/common_corpus/viewer/default/train?f%5Blanguage%5D%5Bvalue%5D=%27English%27&row=10
- https://arxiv.org/html/2506.01732v1#S3
- https://huggingface.co/datasets/HuggingFaceTB/smoltalk2/viewer/Mid/OpenThoughts3_1.2M?row=0
- https://huggingface.co/datasets/swiss-ai/apertus-pretrain-gutenberg