GigaSpeech icon indicating copy to clipboard operation
GigaSpeech copied to clipboard

About gigaspeech glm file

Open CuiMingyu opened this issue 2 years ago • 2 comments

Hi sir,

does gigaspeech provide a glm file like swbd en20000405_hub5.glm containing the transcript filtering rules?

I notice there are some rules in gigaspeech_scoring.py file. But do you have the glm file about all the rules?

Thanks a lot!

CuiMingyu avatar Sep 17 '22 08:09 CuiMingyu

An example of swbd glm: image

CuiMingyu avatar Sep 17 '22 08:09 CuiMingyu

The short answer is, YES and NO.

Actually this is a pretty good question that I'm gonna keep this thread open forever for documentation purposes. And here is the long answer:

On No side: The reason why we don't provide a GLM within GigaSpeech, is that we don't want to mess up the evaluation process with too complex sub-systems(such as TN & Context-Dependent language rewritings), so that downstream research toolkits can integrate and adopt GigaSpeech like a fresh air.

And as you mentioned, we do provide a very simple script containing our recommended text post-processing here, see discussion https://github.com/SpeechColab/GigaSpeech/issues/24 , and it should provide a reliable apple-to-apple basis for academic comparisons.

On Yes side: Taking ASR benchmarking more seriously, like real-life ASR scenarios, we developed a universal benchmarking platform, that contains modules such as:

  • production-grade TN(based on NeMo)
  • sophisticated evaluation tool(supporting GLM, and other stuff, even more than NIST)

They are in our Leaderboard project repo, there you can find a GLM file containing hundreds of rewriting rules already, for English in general, not limited to GigaSpeech. You can help us to improve it if you'd like to, it's an asset for the entire speech community.

Here is a glance of dummy outputs from the scoring tool: Screen Shot 2022-09-17 at 20 24 01

As you can see, raw form of WE ARE are transformed to WE'RE, as the result of a GLM rule WE'RE <-> WE ARE, to match with reference on-the-fly. And we even managed to tag these alternative expansions with # and pretty-aligned, so that error analysis becomes crystal clear.

dophist avatar Sep 17 '22 12:09 dophist