benchmarkstt icon indicating copy to clipboard operation
benchmarkstt copied to clipboard

Ground truth materials

Open MikeSmithEU opened this issue 5 years ago • 1 comments

2 freely available possible datasets have already been identified, more are welcome:

  1. Mozilla Common Voice https://voice.mozilla.org/en
    CC-0 license
  2. Openslr resources http://openslr.org/resources.php
    Each resource has own license ranging from "unrestricted" to "CC-BY-NC-ND 3.0" Remark: Some of the Openslr data is likely to have been used for training various STT systems, as such it may not always be the most fair indicator

Open questions:

  • Which ground truth materials might we use for evaluating vendors' solutions? Will we build our own dataset? Or both?
  • How will we include these resources into our product?

MikeSmithEU avatar Apr 03 '19 11:04 MikeSmithEU

Suggestion: the user of our benchmark should have the choice of which data to use (with a sensible default, following the 'Extensibility', 'Specific over generic' and 'Pragmatism' principles).

MikeSmithEU avatar Apr 03 '19 11:04 MikeSmithEU