echogarden
echogarden copied to clipboard
Recognition: add support for OpenAI's cloud Whisper API
OpenAI provides a subscription-based cloud service that is able to transcribe speech using the largest Whisper model (large-v2):
https://api.openai.com/v1/audio/transcriptions
And translate speech using the same model:
https://api.openai.com/v1/audio/translations
Beyond the very basics, the raw REST API doesn't seem to be well-documented. They provide a Python library that supports many undocumented features, so it can be used for reference.
Based on the official OpenAI Python library, and a post on the issue tracker, it may be possible to get timing information for fragments (that is, groups of few words - not word-level) using a response format of verbose_json. Then, I can then run DTW alignment individually on each segment to approximate word-level timestamps.