FlagEmbedding icon indicating copy to clipboard operation
FlagEmbedding copied to clipboard

Release CodeR-Pile dataset on Hugging Face

Open NielsRogge opened this issue 7 months ago • 1 comments

Hi @545999961 🤗

I'm Niels and work as part of the open-source team at Hugging Face. I discovered your work through Hugging Face's daily papers as yours got featured: https://huggingface.co/papers/2505.12697. The paper page lets people discuss about your paper and lets them find artifacts about it (your models for instance), you can also claim the paper as yours which will show up on your public profile at HF, add Github and project page URLs.

It's great to see the pre-trained model being released on Hugging Face. Would you like to also host the CodeR-Pile dataset on Hugging Face? Hosting on Hugging Face will give you more visibility/enable better discoverability. We can add tags in the dataset card so that people find the dataset easier, link it to the paper page, etc.

We can add tags so that people find them when filtering https://huggingface.co/datasets.

Would be awesome to make the dataset available on 🤗 , so that people can do:

from datasets import load_dataset

dataset = load_dataset("your-hf-org-or-username/your-dataset")

See here for a guide: https://huggingface.co/docs/datasets/loading.

Besides that, there's the dataset viewer which allows people to quickly explore the first few rows of the data in the browser.

Let me know if you're interested/need any help regarding this!

Kind regards,

Niels

NielsRogge avatar May 20 '25 14:05 NielsRogge

Thank you for your suggestion, we will be releasing our training scripts and data in the future.

545999961 avatar May 22 '25 10:05 545999961