cerebros-core-algorithm-alpha icon indicating copy to clipboard operation
cerebros-core-algorithm-alpha copied to clipboard

add-dataset-sources-to-acknowledgements

Open david-thrower opened this issue 2 months ago • 0 comments

#When text generation POC is complete and ready to merge in, add credits to the training data sources in readme.md

Datasets being used thus far:

  • https://archive.org/details/holy-bible-king-james-version-without-chapters-verses-footnotes_202307

Candidates that are being considered (for phase I training):

  • https://huggingface.co/datasets/PleIAs/common_corpus/viewer/default/train?f%5Blanguage%5D%5Bvalue%5D=%27English%27&row=10
  • https://arxiv.org/html/2506.01732v1#S3
  • https://huggingface.co/datasets/HuggingFaceTB/smoltalk2/viewer/Mid/OpenThoughts3_1.2M?row=0
  • https://huggingface.co/datasets/swiss-ai/apertus-pretrain-gutenberg

david-thrower avatar Sep 20 '25 00:09 david-thrower