data_tooling icon indicating copy to clipboard operation
data_tooling copied to clipboard

Tools for managing datasets for governance and training.

Results 100 data_tooling issues
Sort by recently updated
recently updated
newest added

updates: - [github.com/pre-commit/pre-commit-hooks: v4.2.0 → v4.6.0](https://github.com/pre-commit/pre-commit-hooks/compare/v4.2.0...v4.6.0) - [github.com/asottile/pyupgrade: v2.32.1 → v3.15.2](https://github.com/asottile/pyupgrade/compare/v2.32.1...v3.15.2) - [github.com/psf/black: 22.3.0 → 24.4.2](https://github.com/psf/black/compare/22.3.0...24.4.2) - [github.com/Lucas-C/pre-commit-hooks: v1.1.14 → v1.5.5](https://github.com/Lucas-C/pre-commit-hooks/compare/v1.1.14...v1.5.5) - [github.com/shellcheck-py/shellcheck-py: v0.8.0.4 → v0.10.0.1](https://github.com/shellcheck-py/shellcheck-py/compare/v0.8.0.4...v0.10.0.1)

The wiki dates hardcoded are outdated. This adds changes the dates and codes it as a variable (`DEFAULT_WIKI_DATE`) so it can be easily changed as necessary.

Hi, We are much inspired by this great work and are in the process of cleaning our data. However, if we understand correctly, the `remove_non_prining_characters` normalization step is not used...

Hello, We are using this resource to filter pretraining data for our current project, and we would love to know if and how it should be cited. Thanks :)

- uid: hal_archives_ouvertes - type: primary - description: - name: HAL archives ouvertes - description: HAL is an open archive where authors can deposit scholarly documents from all academic fields....

data catalog

- uid: african_union_website - type: primary - description: - name: African Union website - description: The African Union (AU) is a continental body consisting of the 55 member states that...

data catalog

As discussed with @thomasw21, this PR add basic slurm and python scripts to compute an intermiadiary metadata dataset and some statistics for the Pseudo Crawl dataset

- uid: bloom_library - type: primary - description: - name: Bloom Library - description: SIL International’s innovative Bloom software eases the process of bookmaking so that more people can participate...

data catalog
need custodian permission
language modeling script

- uid: multilingual_knowledge_questions_answers - type: processed - description: - name: Multilingual Knowledge Questions & Answers - description: MKQA is an open-domain question answering evaluation set comprising question-answer pairs aligned across...

data catalog
language modeling script

- uid: galileo_open_learning_materials - type: primary - description: - name: Galileo Open Learning Materials - description: GALILEO Open Learning Materials brings together open educational resources throughout the University System of...

data catalog
data format
language modeling script