language-resources
language-resources copied to clipboard
Datasets and tools for basic natural language processing.
Language Resources and Tools
Datasets and scripts for basic natural language and speech processing.
This is not an official Google product.
Natural Languages
Directory | Language Available |
---|---|
af | Afrikaans |
bn | Bengali / Bangla |
hi_ur | Hindi & Urdu |
is | Icelandic |
jv | Javanese |
km | Khmer |
lo | Lao |
my | Burmese / Myanmar |
ne | Nepali |
si | Sinhala |
su | Sundanese |
xh | Xhosa |
zu | Zulu |
Tools
We are including a few tools for working with the natural language datasets. These tools are written in C++ and Python and are built with Bazel. To compile and use these tools, install a recent version of Bazel (minimally Bazel release 0.4.5 is required).
Opensourced Audio Data
Resource | Link |
---|---|
Sinhala TTS recordings (~3K) | https://www.openslr.org/30/ |
TTS recordings for four South African languages (af, st, tn, xh) | https://www.openslr.org/32/ |
Large Javanese ASR training data set (~185K) | https://www.openslr.org/35/ |
Large Sundanese ASR training data set (~220K) | https://www.openslr.org/36/ |
High quality TTS data for Bengali languages | https://www.openslr.org/37/ |
High quality TTS data for Javanese | https://www.openslr.org/41/ |
High quality TTS data for Khmer | https://www.openslr.org/42/ |
High quality TTS data for Nepali | https://www.openslr.org/43/ |
High quality TTS data for Sundanese | https://www.openslr.org/44/ |
Large Sinhala ASR training data set | https://www.openslr.org/52/ |
Large Bengali ASR training data set | https://www.openslr.org/53/ |
Large Nepali ASR training data set | https://www.openslr.org/54/ |
Crowdsourced high-quality Argentinian Spanish speech data set | https://www.openslr.org/61/ |
Crowdsourced high-quality Malayalam multi-speaker speech data set | https://www.openslr.org/63/ |
Crowdsourced high-quality Marathi multi-speaker speech data set | https://www.openslr.org/64/ |
Crowdsourced high-quality Tamil multi-speaker speech data set | https://www.openslr.org/65/ |
Crowdsourced high-quality Telugu multi-speaker speech data set | https://www.openslr.org/66/ |
Data set which contains recordings of Catalan | https://www.openslr.org/69 |
Crowdsourced high-quality Nigerian English speech data set | https://www.openslr.org/70 |
Crowdsourced high-quality Chilean Spanish speech data set | https://www.openslr.org/71 |
Crowdsourced high-quality Colombian Spanish speech data set | https://www.openslr.org/72 |
Crowdsourced high-quality Peruvian Spanish speech data set | https://www.openslr.org/73 |
Crowdsourced high-quality Puerto Rico Spanish speech data set | https://www.openslr.org/74 |
Crowdsourced high-quality Venezuelan Spanish speech data set | https://www.openslr.org/75 |
Crowdsourced high-quality Basque speech data set | https://www.openslr.org/76 |
Crowdsourced high-quality Galician speech data set | https://www.openslr.org/77 |
Crowdsourced high-quality Gujarati multi-speaker speech data set | https://www.openslr.org/78 |
Crowdsourced high-quality Kannada multi-speaker speech data set | https://www.openslr.org/79 |
Crowdsourced high-quality Burmese speech data set | https://www.openslr.org/80 |
Data set which contains male and female recordings of English from various dialects of the UK and Ireland. | https://www.openslr.org/83 |
Crowdsourced high-quality Yoruba speech data set | https://www.openslr.org/86 |
Other reading resources
SLTU 2016 Tutorial - https://sites.google.com/site/sltututorial/overview
Publications
-
Crowdsourcing Latin American Spanish for Low-Resource Text-to-Speech
-
Open-source Multi-speaker Corpora of the English Accents in the British Isles
-
Open-Source High Quality Speech Datasets for Basque, Catalan and Galician
-
Text Normalization for Bangla, Khmer, Nepali, Javanese, Sinhala, and Sundanese TTS Systems
-
FonBund: A Library for Combining Cross-lingual Phonological Segment Data
-
Building Open Javanese and Sundanese Corpora for Multilingual Text-to-Speech
-
Rapid development of TTS corpora for four South African languages
-
Building Statistical Parametric Multi-speaker Synthesis for Bangladeshi Bangla
License
Unless otherwise noted, all original files are licensed under an Apache License, Version 2.0.
Where specifically noted, some datasets are licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0).
The directory third_party/ contains third-party works, which we are including under the respective licenses of the upstream projects. See third_party/README.md for further details.