feat: add SenseVoice ASR model and improved model ranking

Open tmm22 opened this issue 1 month ago • 1 comments

Summary

This PR adds support for Alibaba's SenseVoice ASR model and implements an improved model ranking system.

New Features

SenseVoice ASR Model Integration

SenseVoice - Alibaba's ultra-fast multilingual ASR model
- Supports Chinese, Cantonese, English, Japanese, Korean
- 15x faster than Whisper with 96% accuracy
- 234 MB model size using ONNX Runtime inference
- Uses LFR (Low Frame Rate) feature stacking for efficiency

Dynamic Model Ranking System

Added speed and accuracy properties to the TranscriptionModel protocol
Models are now sorted using a geometric mean score: sqrt(accuracy × speed)
Models excelling at BOTH speed AND accuracy rank highest
Bonus (+0.1) applied for models with high accuracy (>=0.94) AND high speed (>=0.75)
Sorting applied across Recommended, Local, and Cloud tabs

Files Added

File	Description
`SenseVoiceTranscriptionService.swift`	ONNX-based inference with LFR feature stacking
`SenseVoiceTokenizer.swift`	Token decoding for SenseVoice output format
`SenseVoiceModelCardView.swift`	UI card with speed/accuracy ratings
`WhisperState+SenseVoice.swift`	Model download, delete, and management
`FastConformerFeatureExtractor.swift`	Audio feature extraction for ONNX models

Files Modified

TranscriptionModel.swift - Added SenseVoiceModel struct and protocol extensions
PredefinedModels.swift - Added SenseVoice model definition
ModelManagementView.swift - Added ranking algorithm and SenseVoice actions
ModelCardRowView.swift - Added SenseVoice card rendering
WhisperState.swift - Added service and routing
project.pbxproj - Added onnxruntime-swift-package-manager dependency

Dependencies

This PR adds the onnxruntime-swift-package-manager package from Microsoft for ONNX Runtime inference.

Testing

[x] Build succeeds
[x] Model downloads successfully
[x] Transcription works correctly
[x] Model ranking sorts models as expected

Screenshots

The SenseVoice model appears in the Local models tab with speed/accuracy ratings displayed.

Summary by cubic

Adds SenseVoice multilingual ASR via ONNX Runtime and a new ranking that prioritizes models that are both fast and accurate. This brings much faster Asian-language transcription and better model recommendations across tabs.

New Features
- SenseVoice ASR integration (zh/yue/en/ja/ko), ~234 MB, up to 15x faster than Whisper.
- LFR feature stacking, custom tokenizer, and greedy decoding with a fast feature extractor.
- SenseVoice model card with download/delete/show-in-Finder and progress UI.
- Ranking updates: added speed/accuracy to TranscriptionModel; geometric-mean score with a small bonus; applied to Recommended, Local, and Cloud.
Bug Fixes
- Added SenseVoice routing in AudioFileTranscriptionService and Manager to prevent cloud fallback.
- More robust downloads and decoding: HTTP status validation and CTC collapse for cleaner text.

^{Written for commit 7cb1b5478408792a5ebe65afc61deb265fd0983e. Summary will update automatically on new commits.}

Nov 28 '25 00:11 tmm22