Nathan Cooper
Nathan Cooper
Follow work in data documentation space such as https://arxiv.org/abs/1803.09010 and https://arxiv.org/abs/2201.07311 We will be basing our documentation off the template from huggingface: https://github.com/huggingface/datasets/blob/main/templates/README.md
We should follow a similar process to the BigScience workshop's dataset processing. They include many of the tools ready for us to use such as data deduplication, both exact match...
## Mailing Lists Dataset URL - * [git](https://git-scm.com/community) * [python](https://www.python.org/community/lists/) * [mailing list archives](https://www.mail-archive.com/) Does the dataset exists in a scraped format ? No ## Description In general. (Almost) every...
## GitHub Issues Dataset URL - [here](https://huggingface.co/datasets/lewtun/github-issues) Does the dataset exists in a scraped format ? URL if Yes - [here](https://huggingface.co/datasets/lewtun/github-issues) Only for HF datasets repository ## Description GitHub Issues...
## Discourse Forums Dataset URL - [here](https://meta.discourse.org/t/listing-of-all-discourse-forums/113669) Does the dataset exists in a scraped format ? No ## Description Discourse is a self-hosting platform for communities to create discussions around...
Questions: * How do we want to store data in an intermediate format before moving it to the lm_dataformat that uses json lists? * Do we even want an intermediate...
## Zulip Discussions Dataset URL - [here](https://coq.gitlab.io/zulip-archive/) Does the dataset exists in a scraped format ? No ## Description Zulip is a real-time chat application for self-hosting or cloud based...
## Gitter Discussions Dataset URL - [here](https://gitter.im/explore/tags/?action=explore&source=homepage) Does the dataset exists in a scraped format ? No ## Description Gitter is a chat and networking platform that helps to manage,...
This issue focuses on collecting ideas and formalizing the postprocessing steps and formatting of data instances for datasets in different categories, e.g., forums, articles, books, etc. Initial draft of postprocessing:...
# What does this PR do? The perceiver resampler in idefics2 initializes latents as ones, but they should be initialized as random numbers with a gaussian distribution (see this implementation:...