VoiceInk icon indicating copy to clipboard operation
VoiceInk copied to clipboard

feat: add SenseVoice ASR model and improved model ranking

Open tmm22 opened this issue 1 month ago • 1 comments

Summary

This PR adds support for Alibaba's SenseVoice ASR model and implements an improved model ranking system.

New Features

SenseVoice ASR Model Integration

  • SenseVoice - Alibaba's ultra-fast multilingual ASR model
    • Supports Chinese, Cantonese, English, Japanese, Korean
    • 15x faster than Whisper with 96% accuracy
    • 234 MB model size using ONNX Runtime inference
    • Uses LFR (Low Frame Rate) feature stacking for efficiency

Dynamic Model Ranking System

  • Added speed and accuracy properties to the TranscriptionModel protocol
  • Models are now sorted using a geometric mean score: sqrt(accuracy × speed)
  • Models excelling at BOTH speed AND accuracy rank highest
  • Bonus (+0.1) applied for models with high accuracy (>=0.94) AND high speed (>=0.75)
  • Sorting applied across Recommended, Local, and Cloud tabs

Files Added

File Description
SenseVoiceTranscriptionService.swift ONNX-based inference with LFR feature stacking
SenseVoiceTokenizer.swift Token decoding for SenseVoice output format
SenseVoiceModelCardView.swift UI card with speed/accuracy ratings
WhisperState+SenseVoice.swift Model download, delete, and management
FastConformerFeatureExtractor.swift Audio feature extraction for ONNX models

Files Modified

  • TranscriptionModel.swift - Added SenseVoiceModel struct and protocol extensions
  • PredefinedModels.swift - Added SenseVoice model definition
  • ModelManagementView.swift - Added ranking algorithm and SenseVoice actions
  • ModelCardRowView.swift - Added SenseVoice card rendering
  • WhisperState.swift - Added service and routing
  • project.pbxproj - Added onnxruntime-swift-package-manager dependency

Dependencies

This PR adds the onnxruntime-swift-package-manager package from Microsoft for ONNX Runtime inference.

Testing

  • [x] Build succeeds
  • [x] Model downloads successfully
  • [x] Transcription works correctly
  • [x] Model ranking sorts models as expected

Screenshots

The SenseVoice model appears in the Local models tab with speed/accuracy ratings displayed.


Summary by cubic

Adds SenseVoice multilingual ASR via ONNX Runtime and a new ranking that prioritizes models that are both fast and accurate. This brings much faster Asian-language transcription and better model recommendations across tabs.

  • New Features

    • SenseVoice ASR integration (zh/yue/en/ja/ko), ~234 MB, up to 15x faster than Whisper.
    • LFR feature stacking, custom tokenizer, and greedy decoding with a fast feature extractor.
    • SenseVoice model card with download/delete/show-in-Finder and progress UI.
    • Ranking updates: added speed/accuracy to TranscriptionModel; geometric-mean score with a small bonus; applied to Recommended, Local, and Cloud.
  • Bug Fixes

    • Added SenseVoice routing in AudioFileTranscriptionService and Manager to prevent cloud fallback.
    • More robust downloads and decoding: HTTP status validation and CTC collapse for cleaner text.

Written for commit 7cb1b5478408792a5ebe65afc61deb265fd0983e. Summary will update automatically on new commits.

tmm22 avatar Nov 28 '25 00:11 tmm22