GigaSpeech
GigaSpeech copied to clipboard
Large, modern dataset for speech recognition
Hi sir, does gigaspeech provide a glm file like swbd en20000405_hub5.glm containing the transcript filtering rules? I notice there are some rules in gigaspeech_scoring.py file. But do you have the...
Fixing the bug mentioned in https://github.com/SpeechColab/GigaSpeech/issues/103
Hello, the download failed due to the disconnection of the network connection in the process of downloading audio data. How can I continue to download from the disconnection point?
GigaSpeech dataset is now available on HuggingFace Hub. --- ### Highlights of GigaSpeech on HuggingFace * easy to use (a two-liner in python) * Smoother and faster downloading from US...
Hi, As mentioned in the README, GigaSpeech contains "33,000+ hours for unsupervised/semi-supervised learning". I am trying to use these unlabeled data, and I have already downloaded the XL subset. But...
Hello, I saw that `sample_rate=16000` in `GigaSpeech.Json` does not match with the one in opus file `SR=48000`: ``` ffmpeg -i /workspace/datasets/GigaSpeech_corpus/audio/podcast/P0001/POD0000000001.opus ffmpeg version 4.3 Copyright (c) 2000-2020 the FFmpeg developers...
See the discussion here: https://github.com/SpeechColab/PySpeechColab/pull/2 We should make changes to https://github.com/SpeechColab/GigaSpeech/blob/main/utils/download_meta.sh to include the json->jsonl conversion. The conversion command is `jq -c '.audios[]' GigaSpeech.json > GigaSpeech.jsonl` See examples here for...
Hi, is there an official number for the final number of words in the test set for scoring? WeNet results say there are 19928 sentences and 390656 words: https://github.com/wenet-e2e/wenet/tree/main/examples/gigaspeech/s0 Kaldi...
We should provide scoring scripts (e.g., for normalization) so that results from different toolkits are comparable.