mlx-audio icon indicating copy to clipboard operation
mlx-audio copied to clipboard

Add Model harness with WER score and tests (STT)

Open Blaizzy opened this issue 9 months ago • 0 comments

Description

We need to implement a model harness for evaluating Speech-to-Text (STT) models that calculates Word Error Rate (WER) as the primary performance metric, along with comprehensive tests.

Requirements

  • Implement a model harness that can load and evaluate any STT model in our system
  • Calculate WER (Word Error Rate) as the primary metric
  • Support additional metrics where appropriate (CER, BLEU, etc.)
  • Provide test utilities to generate synthetic audio for testing edge cases
  • Create benchmark test suite with standard datasets (LibriSpeech, Common Voice, etc.)
  • Support different audio formats and sampling rates
  • Generate comprehensive reports with per-utterance and aggregate scores

Acceptance Criteria

  • [x] Model harness successfully loads and evaluates STT models
  • [x] WER calculation matches reference implementation (tested against known examples)
  • [x] Test suite covers at least 3 standard STT datasets
  • [x] Documentation includes examples and explanation of metrics
  • [x] CI integration ensures WER doesn't regress on benchmark datasets

Blaizzy avatar May 09 '25 18:05 Blaizzy