mlx-audio
mlx-audio copied to clipboard
Add Model harness with WER score and tests (STT)
Description
We need to implement a model harness for evaluating Speech-to-Text (STT) models that calculates Word Error Rate (WER) as the primary performance metric, along with comprehensive tests.
Requirements
- Implement a model harness that can load and evaluate any STT model in our system
- Calculate WER (Word Error Rate) as the primary metric
- Support additional metrics where appropriate (CER, BLEU, etc.)
- Provide test utilities to generate synthetic audio for testing edge cases
- Create benchmark test suite with standard datasets (LibriSpeech, Common Voice, etc.)
- Support different audio formats and sampling rates
- Generate comprehensive reports with per-utterance and aggregate scores
Acceptance Criteria
- [x] Model harness successfully loads and evaluates STT models
- [x] WER calculation matches reference implementation (tested against known examples)
- [x] Test suite covers at least 3 standard STT datasets
- [x] Documentation includes examples and explanation of metrics
- [x] CI integration ensures WER doesn't regress on benchmark datasets