MaSS - Multilingual corpus of Sentence-aligned Spoken utterances

This is the repository for the CMU multilingual speech extension data set presented in the paper entitled MaSS: A Large and Clean Multilingual Corpus of Sentence-aligned Spoken Utterances Extracted from the Bible.

Data

For copyright reasons, we are not allowed to share the audio files however, we provide the extraction pipeline below. We also highlight this pipeline can be used to new languages of interested. Inside the dataset folder, for each language we provide:

Alignment textgrids (from Maus forced aligner)
Final textual output and segments textgrids
Mel Filterbank Spectrograms (such as used in the paper's experiments)

Pipeline

1) Downloading audio chapters from bible.is.

1.1. The audios used in our work are available in the following links:

1.2. The audios were converted from multi to single channel and forced aligned by using this script.

1.3. The raw chapter text files are not available for download anymore at the website. Thus, we provide them at dataset/LANGUAGE/raw_txt/. For new languages, chapter text files can be extracted from this webpage. These .txt files (chapter level) should be put on the same folder than the audios.

2) Aligning the data with Maus forced aligner

For the covered languages, we make available the output from the Maus forced aligner in LANGUAGE/maus_textgrid/. For new languages, please check the Website.

3) Obtaining speech alignment on a verse level

For each language, the audios were sliced in verses considering the output of 1.3. and the generated texgrids (2.). More details available here.

4) ID equivalence across languages

For translating the IDs in English, we provide the simple python script below.

python3 scripts/fetch_data.py <language folder> <output folder> <language code>

5) Generate a CSV file listing the verses available for each language

Use this script to tenerate a CSV files listing the verses available for each language. As not all the verses of a given language exist in another language, this CSV file can be use to get a list of verses common to all languages.

Paper Experiments

The speech-to-speech retrieval baseline model proposed at the paper is available here.

Citation

If you use this corpus in your experiments, please use the following bibtex for citation

@inproceedings{zanon-boito-etal-2020-mass, title = {{M}a{SS}: {A} {L}arge and {C}lean {M}ultilingual {C}orpus of {S}entence-aligned {S}poken {U}tterances {E}xtracted from the {B}ible}, author = {Zanon Boito*, Marcely and Havard*, William and Garnerin, Mahault and Le Ferrand, Éric and Besacier, Laurent}, booktitle = {Proceedings of the 12th Language Resources and Evaluation Conference}, month = may, year = {2020}, address = {Marseille, France}, publisher = {European Language Resources Association}, url = {https://aclanthology.org/2020.lrec-1.799}, pages = {6486--6493}, language = {English}, isbn = {979-10-95546-34-4}, }

Team and Contact

The people behind the (325) project:

Marcely ZANON BOITO
William N. HAVARD
Mahault GARNERIN
Eric Le FERRAND
Laurent BESACIER

You can contact them at [email protected]

mass-dataset
mass-dataset copied to clipboard

Metadata

MaSS - Multilingual corpus of Sentence-aligned Spoken utterances

Data

Pipeline

1) Downloading audio chapters from bible.is.

2) Aligning the data with Maus forced aligner

3) Obtaining speech alignment on a verse level

4) ID equivalence across languages

5) Generate a CSV file listing the verses available for each language

Paper Experiments

Citation

Team and Contact

← Metadata

Owner

Metadata

mass-dataset mass-dataset copied to clipboard

Metadata

MaSS - Multilingual corpus of Sentence-aligned Spoken utterances

Data

Pipeline

1) Downloading audio chapters from bible.is.

2) Aligning the data with Maus forced aligner

3) Obtaining speech alignment on a verse level

4) ID equivalence across languages

5) Generate a CSV file listing the verses available for each language

Paper Experiments

Citation

Team and Contact

← Metadata

Owner

Metadata

mass-dataset
mass-dataset copied to clipboard