benchmarkstt
benchmarkstt copied to clipboard
Ground truth materials
2 freely available possible datasets have already been identified, more are welcome:
- Mozilla Common Voice https://voice.mozilla.org/en
CC-0 license - Openslr resources http://openslr.org/resources.php
Each resource has own license ranging from "unrestricted" to "CC-BY-NC-ND 3.0" Remark: Some of the Openslr data is likely to have been used for training various STT systems, as such it may not always be the most fair indicator
Open questions:
- Which ground truth materials might we use for evaluating vendors' solutions? Will we build our own dataset? Or both?
- How will we include these resources into our product?
Suggestion: the user of our benchmark should have the choice of which data to use (with a sensible default, following the 'Extensibility', 'Specific over generic' and 'Pragmatism' principles).