data_tooling
data_tooling copied to clipboard
Tools for managing datasets for governance and training.
updates: - [github.com/pre-commit/pre-commit-hooks: v4.2.0 → v4.6.0](https://github.com/pre-commit/pre-commit-hooks/compare/v4.2.0...v4.6.0) - [github.com/asottile/pyupgrade: v2.32.1 → v3.15.2](https://github.com/asottile/pyupgrade/compare/v2.32.1...v3.15.2) - [github.com/psf/black: 22.3.0 → 24.4.2](https://github.com/psf/black/compare/22.3.0...24.4.2) - [github.com/Lucas-C/pre-commit-hooks: v1.1.14 → v1.5.5](https://github.com/Lucas-C/pre-commit-hooks/compare/v1.1.14...v1.5.5) - [github.com/shellcheck-py/shellcheck-py: v0.8.0.4 → v0.10.0.1](https://github.com/shellcheck-py/shellcheck-py/compare/v0.8.0.4...v0.10.0.1)
The wiki dates hardcoded are outdated. This adds changes the dates and codes it as a variable (`DEFAULT_WIKI_DATE`) so it can be easily changed as necessary.
Hi, We are much inspired by this great work and are in the process of cleaning our data. However, if we understand correctly, the `remove_non_prining_characters` normalization step is not used...
Hello, We are using this resource to filter pretraining data for our current project, and we would love to know if and how it should be cited. Thanks :)
- uid: hal_archives_ouvertes - type: primary - description: - name: HAL archives ouvertes - description: HAL is an open archive where authors can deposit scholarly documents from all academic fields....
- uid: african_union_website - type: primary - description: - name: African Union website - description: The African Union (AU) is a continental body consisting of the 55 member states that...
As discussed with @thomasw21, this PR add basic slurm and python scripts to compute an intermiadiary metadata dataset and some statistics for the Pseudo Crawl dataset
- uid: bloom_library - type: primary - description: - name: Bloom Library - description: SIL International’s innovative Bloom software eases the process of bookmaking so that more people can participate...
- uid: multilingual_knowledge_questions_answers - type: processed - description: - name: Multilingual Knowledge Questions & Answers - description: MKQA is an open-domain question answering evaluation set comprising question-answer pairs aligned across...
- uid: galileo_open_learning_materials - type: primary - description: - name: Galileo Open Learning Materials - description: GALILEO Open Learning Materials brings together open educational resources throughout the University System of...