data_tooling
data_tooling copied to clipboard
Tools for managing datasets for governance and training.
Subsets of The Pile: - pubmed - ubuntu_irc - europarl - hacker_news - nih_exporter
Subset of The Pile. FreeLaw: Good as-is, I have acquired permission to use this from the org that owns the data (reported by @StellaAthena)
- uid: theses_on_line - type: primary - description: - name: Theses on Line - description: Created in 2001, TEL (Theses-on-Line) is dedicated to the self-archiving of theses and HDRs (accreditations...
- uid: libre_commons - type: primary - description: - name: LibreCommons - description: LibreCommons hosts curated Open Educational Resources from all 14 LibreTexts libraries in one convenient location. LibreCommons, the...
Source: [Masader Project](https://arbml.github.io/masader/) - uid: talaa - entry: https://arbml.github.io/masader/card.html?54 - Link: https://github.com/saidziani/Arabic-News-Article-Classification - License : unknown - Year: 2015 - Language: ar - Dialect: ar-MSA: (Arabic (Modern Standard Arabic)) -...
Source: [Masader Project](https://arbml.github.io/masader/) - uid: arabic_online_commentary - entry: https://arbml.github.io/masader/card.html?39 - Link: https://github.com/sjeblee/AOC - License : unknown - Year: 2011 - Language: ar - Dialect: other - Domain: news articles -...
Source: [Masader Project](https://arbml.github.io/masader/) - uid: osian - entry: https://arbml.github.io/masader/card.html?25 - Link: http://oujda-nlp-team.net/en/corpora/osian-corpus/ - License : CC BY-NC 4.0 - Year: 2019 - Language: ar - Dialect: other - Domain: news...
- uid: wikihow_vietnamese_human_instructions - type: processed - description: - name: wikiHow Vietnamese Human Instructions - description: Step-by-step instructions in Vietnamese extracted from wikiHow and decomposed into a formal graph representation...
- uid: vicon_visim400 - type: processed - description: - name: Vietnamese Datasets for Evaluating Semantic Models of (Dis-)Similarity and Relatedness (ViCon and ViSim-400) - description: This dataset consists of two...
- uid: ahotsak - type: primary - description: - name: ahotsak - description: Catalogue of Basque Oral Heritage, interviews to elderly people about their experiences. - homepage: https://ahotsak.eus/ - validated:...