Recognition: add support for OpenAI's cloud Whisper API

Open rotemdan opened this issue 2 years ago • 0 comments

OpenAI provides a subscription-based cloud service that is able to transcribe speech using the largest Whisper model (large-v2):

https://api.openai.com/v1/audio/transcriptions

And translate speech using the same model:

https://api.openai.com/v1/audio/translations

Beyond the very basics, the raw REST API doesn't seem to be well-documented. They provide a Python library that supports many undocumented features, so it can be used for reference.

Based on the official OpenAI Python library, and a post on the issue tracker, it may be possible to get timing information for fragments (that is, groups of few words - not word-level) using a response format of verbose_json. Then, I can then run DTW alignment individually on each segment to approximate word-level timestamps.

Jul 29 '23 12:07 rotemdan