cosmopedia icon indicating copy to clipboard operation
cosmopedia copied to clipboard

Results 13 cosmopedia issues
Sort by recently updated
recently updated
newest added

Is the prompt used for content educational scoring part of this repo? Did you use Mixtral to score/classify content or was dedicated classifier trained?

https://github.com/huggingface/cosmopedia/blob/main/deduplication/deduplicate_dataset.py ``` 2024-02-22 14:17:57.759 | INFO | datatrove.executor.slurm:launch_job:216 - Launching dependency job "mh3" 2024-02-22 14:17:57.759 | INFO | datatrove.executor.slurm:launch_job:216 - Launching dependency job "mh2" 2024-02-22 14:17:57.759 | INFO | datatrove.executor.slurm:launch_job:216...

Wow, this is super cool work, and thanks for open sourcing everything!! I wonder if cosmopedia tries incorporating code data as seeds to rephrase them into high-quality data? We did...

Awesome work 🙂 Is there any plan to release the training code for cosmo-1b? Or at least details about what existing repos and framework tools were used?

Couldn't reach 'HuggingFaceTB/web_clusters' on the Hub (ConnectionError) How can i solve this problam

Thank you for sharing. Some common models like MMLU typically use a 5-shot setting to measure a model's in-context learning capabilities. Can you explain why MMLU evaluations use a zero-shot...

Can you share your prompt about code scoring data production? I want to make a c and c++ dataset for pre-training using this prompt. Of course, if you have already...

Thanks for your great work! From https://github.com/huggingface/cosmopedia/tree/main/evaluation#benchmark-evaluation, is this the exact command you are using for evaluation? Because I found most of them are 0-shot which is inconsistent with the...