DUAL-textless-SQA
DUAL-textless-SQA copied to clipboard
Textless (ASR-transcript free) Spoken Question Answering. The official release of NMSQA dataset and the implementation of "DUAL: Textless Spoken Question Answering with Speech Discrete Unit Adaptive L...
DUAL-textless-SQA
This repository is the official implementation for DUAL: Discrete Spoken Unit Adaptive Learning for Textless Spoken Question Answering paper, and the release of the Natural Multi-speakers Spoken Question Answering (NMSQA) dataset.
Installation
Model
Dataset
Download the NMSQA dataset
Data Preparation for Original Dataset
Preprocessed data link (including passage merging and unit-level labels, updated with question code): [link]
-
Directory format
- train
- dev
- test
-
Files
- For train and dev split
{split}-answer-span.csv
: answer time span in secondsmeta-{split}.csv: the duration
, speaker, and transcription of each utterance{split}-textgrid.tar.gz
: force alignment of each utterance{split}_audio.tar.gz
: utterance waveform files{split}_hash2question.json
: map the hash value to question id - For test split
lxt_sqa.tar.gz
: contains all audio files inaudio
and transcriptionsmeta-lxt.csv
: the duration, speaker, and transcription of each utterancetest/test-SQuAD/test-SQuAD-answer-span.csv
: the answer span in the test-SQuAD splittest/test-OOD/test-OOD-answer-span.csv
: the answer span in the test-OOD split
NOTE Current the spoken passage is split to segments of utterances. For the standard QA task, you should merge the segments back to the whole passages. The suffix of
-1
,-2
, ...,-n
is the segment number of specific passage.- Speech Content Encoder
Please see details in
speeech-content-encoder
. - Pre-process the QA labels
python code_answer.py
- For train and dev split
Parquet Format & Huggingface Format dataset
It basically follow the same file format as the Origin SQuAD with the following extra field:
{
"id": Same as SQuAD,
"title": Same as SQuAD,
"context": Same as SQuAD,
"question": Same as SQuAD,
"answers":{
"answer_start": Same as SQuAD,
"audio_full_answer_end":[], Audio answer end position in second
"audio_full_answer_start":[], Audio answer start position in second
"audio_full_neg_answer_end":[], Audio answer end position in second that using the same words but not the correct one
"audio_full_neg_answer_start":[], Audio answer start position in second that using the same words but not the correct one
"audio_segment_answer_end":[],
"audio_segment_answer_start":[],
"text": Same as SQuAD
},
"content_segment_audio_path": Segment Audio Path,
"content_full_audio_path": Complete Audio Path,
"content_audio_sampling_rate": Audio Sampling Rate,
"content_audio_speaker": Audio Speaker,
"content_segment_text":"",
"content_segment_normalized_text": Normalized Text for generating audio,
"question_audio_path": Question Audio Path,
"question_audio_sampling_rate": Audio Sampling Rate,
"question_audio_speaker": Audio Speaker,
"question_normalized_text": Normalized Text for generating audio,
}
Training
python train.py --exp_name [exp name] --config baseline.yaml
Evaluation
python evaluate.py --data_dir [data dir path] --model_path [model checkpoint dir] --output_dir [output dir path] --out_fname [output name]
Results
Discrete unit | PLM | dev FF1 | dev AOS | test FF1 | test AOS |
---|---|---|---|---|---|
HuBERT-64 | Longformer | 47.8 | 42.4 | 39.0 | 33.0 |
HuBERT-128 | Longformer | 54.2 | 48.5 | 56.0 | 49.1 |
HuBERT-512 | Longformer | 55.0 | 49.6 | 17.3 | 12.5 |
Contact
Guan-Ting Lin (Email: [email protected]) Eric Lam (Email: [email protected])
Citation
@inproceedings{lin22c_interspeech,
author={Guan-Ting Lin and Yung-Sung Chuang and Ho-Lam Chung and Shu-wen Yang and Hsuan-Jui Chen and Shuyan Annie Dong and Shang-Wen Li and Abdelrahman Mohamed and Hung-yi Lee and Lin-shan Lee},
title={{DUAL: Discrete Spoken Unit Adaptive Learning for Textless Spoken Question Answering}},
year=2022,
booktitle={Proc. Interspeech 2022},
pages={5165--5169},
doi={10.21437/Interspeech.2022-612}
}