cerebros-core-algorithm-alpha
cerebros-core-algorithm-alpha copied to clipboard

Published 20 hours ago •

Reame
Issues

add-dataset-sources-to-acknowledgements

Open david-thrower opened this issue 2 months ago • 0 comments

#When text generation POC is complete and ready to merge in, add credits to the training data sources in readme.md

Datasets being used thus far:

https://archive.org/details/holy-bible-king-james-version-without-chapters-verses-footnotes_202307

Candidates that are being considered (for phase I training):

https://huggingface.co/datasets/PleIAs/common_corpus/viewer/default/train?f%5Blanguage%5D%5Bvalue%5D=%27English%27&row=10
https://arxiv.org/html/2506.01732v1#S3
https://huggingface.co/datasets/HuggingFaceTB/smoltalk2/viewer/Mid/OpenThoughts3_1.2M?row=0
https://huggingface.co/datasets/swiss-ai/apertus-pretrain-gutenberg

Sep 20 '25 00:09 david-thrower

Labels

kind/documentation

triage/high-priority

kind/legal

audience/technical

Owner

Other Repo Issues