feat: Implement OpenAI style local API server for audio transcription

Open Yorick-Ryu opened this issue 2 weeks ago • 5 comments

Before Submitting This PR

Please confirm you have done the following:

[X] I have searched existing issues and pull requests (including closed ones) to ensure this isn't a duplicate
[X] I have read CONTRIBUTING.md

If this is a feature or change that was previously closed/rejected:

[ ] I have explained in the description below why this should be reconsidered
[ ] I have gathered community feedback (link to discussion below)

Human Written Description

I implemented a local STT API that follows the OpenAI Whisper format. Currently, the Whisper model is only accessible within Handy; however, many users want to leverage this functionality for external tasks like subtitle transcription without loading multiple model instances. This change exposes the speech-to-text capability as a standardized service, allowing users to do more with limited system memory.

Related Issues/Discussions

Fixes # None Discussion: https://github.com/cjpais/Handy/discussions/241

Community Feedback

https://github.com/cjpais/Handy/discussions/241

Testing

Environment:

Tested on: macOS 26.2 (Apple Silicon M1 Pro)
Status: Functional on macOS. Need help testing on Windows and Linux platforms to ensure consistent behavior.

Test Cases:

Features: Tested by calling the API using curl and Demo: convert MP3 to SRT
On-demand Loading: Verified via curl that calling the /v1/audio/transcriptions endpoint correctly triggers the model loading process in the background.
Waiting Mechanism: Confirmed the API response waits until the model is fully loaded before processing the transcription, preventing "Model not loaded" errors.
Verified Limitations: Tested various audio formats and confirmed only MP3 currently works reliably; documented this behavior and added a "welcome PRs" note in LOCAL_API.md to guide future contributors.

Screenshots/Videos (if applicable)

Jan 02 '26 14:01 Yorick-Ryu