GigaSpeech icon indicating copy to clipboard operation
GigaSpeech copied to clipboard

Large, modern dataset for speech recognition

Results 18 GigaSpeech issues
Sort by recently updated
recently updated
newest added

Hi sir, does gigaspeech provide a glm file like swbd en20000405_hub5.glm containing the transcript filtering rules? I notice there are some rules in gigaspeech_scoring.py file. But do you have the...

documentation

Fixing the bug mentioned in https://github.com/SpeechColab/GigaSpeech/issues/103

Hello, the download failed due to the disconnection of the network connection in the process of downloading audio data. How can I continue to download from the disconnection point?

documentation

GigaSpeech dataset is now available on HuggingFace Hub. --- ### Highlights of GigaSpeech on HuggingFace * easy to use (a two-liner in python) * Smoother and faster downloading from US...

documentation

Hi, As mentioned in the README, GigaSpeech contains "33,000+ hours for unsupervised/semi-supervised learning". I am trying to use these unlabeled data, and I have already downloaded the XL subset. But...

documentation

Hello, I saw that `sample_rate=16000` in `GigaSpeech.Json` does not match with the one in opus file `SR=48000`: ``` ffmpeg -i /workspace/datasets/GigaSpeech_corpus/audio/podcast/P0001/POD0000000001.opus ffmpeg version 4.3 Copyright (c) 2000-2020 the FFmpeg developers...

documentation

See the discussion here: https://github.com/SpeechColab/PySpeechColab/pull/2 We should make changes to https://github.com/SpeechColab/GigaSpeech/blob/main/utils/download_meta.sh to include the json->jsonl conversion. The conversion command is `jq -c '.audios[]' GigaSpeech.json > GigaSpeech.jsonl` See examples here for...

documentation
enhancement

Hi, is there an official number for the final number of words in the test set for scoring? WeNet results say there are 19928 sentences and 390656 words: https://github.com/wenet-e2e/wenet/tree/main/examples/gigaspeech/s0 Kaldi...

documentation

We should provide scoring scripts (e.g., for normalization) so that results from different toolkits are comparable.

documentation