speech-datasets
speech-datasets copied to clipboard
Various speech datasets made available to the public
Hello, when I do the [following](https://github.com/revdotcom/speech-datasets#steps-to-download-from-lfs): ``` cd earnings22 git lfs pull ``` There's such errors: ``` batch response: This repository is over its data quota. Account responsible for LFS...
Starting at line `10876` in `4341191.nlp` the labels for every field except `token` seem to be shifted down by one. For example, the token `uh-` here is tagged as `1649`...
`earnings21/earnings21-file-metadata.csv` seems to disagree with the output of `lhotse prepare earnings21` from https://github.com/lhotse-speech/lhotse/blob/master/lhotse/recipes/earnings21.py, namely the duration and sample count
Hi there, Are there any normalization files for the earnings-22 dataset? If yes, could you please share it with me? Thanks in advance.
https://github.com/revdotcom/speech-datasets/blob/1852d8e8f79745415e17ed294f1de0f884513465/earnings21/transcripts/nlp_references/4363614.nlp#L2-L44 It seems the transcript there has some issue, as quoted. E.g. `` for company's name, `` for person's name. This can be checked against [here](https://seekingalpha.com/article/4363614-banco-santander-mexico-s-bsmx-ceo-hector-grisi-on-q2-2020-results-earnings-call-transcript)
Hi ! Would you consider making the audio and transcriptions for the podcast dataset mentioned [in your blogpost](https://www.rev.com/blog/the-podcast-challenge-testing-rev-ais-speech-recognition-accuracy) available in this repository ? Thanks !
Earnings21: - Fix file 4341191 labels that are shifted off by one - Resolves #35 Earnings22: - Fixed casing label of numerics from `UC`/`CA`/`LC` to `N/A` - Fixed preparation error...
License in Earnings21 says: > The transcripts and associated text files that are used for alignment in this directory are licensed under a [Creative Commons Attribution-ShareAlike 4.0 International](https://creativecommons.org/licenses/by-sa/4.0/) license. What...