cosmopedia
cosmopedia copied to clipboard
Is the prompt used for content educational scoring part of this repo? Did you use Mixtral to score/classify content or was dedicated classifier trained?
https://github.com/huggingface/cosmopedia/blob/main/deduplication/deduplicate_dataset.py ``` 2024-02-22 14:17:57.759 | INFO | datatrove.executor.slurm:launch_job:216 - Launching dependency job "mh3" 2024-02-22 14:17:57.759 | INFO | datatrove.executor.slurm:launch_job:216 - Launching dependency job "mh2" 2024-02-22 14:17:57.759 | INFO | datatrove.executor.slurm:launch_job:216...
Wow, this is super cool work, and thanks for open sourcing everything!! I wonder if cosmopedia tries incorporating code data as seeds to rephrase them into high-quality data? We did...
Awesome work 🙂 Is there any plan to release the training code for cosmo-1b? Or at least details about what existing repos and framework tools were used?
Couldn't reach 'HuggingFaceTB/web_clusters' on the Hub (ConnectionError) How can i solve this problam
Thank you for sharing. Some common models like MMLU typically use a 5-shot setting to measure a model's in-context learning capabilities. Can you explain why MMLU evaluations use a zero-shot...
Can you share your prompt about code scoring data production? I want to make a c and c++ dataset for pre-training using this prompt. Of course, if you have already...
Thanks for your great work! From https://github.com/huggingface/cosmopedia/tree/main/evaluation#benchmark-evaluation, is this the exact command you are using for evaluation? Because I found most of them are 0-shot which is inconsistent with the...