whisper-asr-webservice icon indicating copy to clipboard operation
whisper-asr-webservice copied to clipboard

Proof of concept implementation for OpenAI compatible API format

Open ayancey opened this issue 6 months ago • 1 comments

Quick and dirty implementation of what it would look like to support OpenAI's API format. This is an attempt to satisfy #227.

Example OpenAI output from /v1/audio/transcriptions:

{
    "task": "transcribe",
    "language": "english",
    "duration": 9.90999984741211,
    "text": "The dog jumped over the big fence and then it ran over to the farm.",
    "segments": [
        {
            "id": 0,
            "seek": 0,
            "start": 0.0,
            "end": 11.0,
            "text": " The dog jumped over the big fence and then it ran over to the farm.",
            "tokens": [
                50364,
                440,
                3000,
                13864,
                670,
                264,
                955,
                15422,
                293,
                550,
                309,
                5872,
                670,
                281,
                264,
                5421,
                13,
                50914
            ],
            "temperature": 0.0,
            "avg_logprob": -0.3397972285747528,
            "compression_ratio": 1.0634920597076416,
            "no_speech_prob": 0.02906951494514942
        }
    ]
}

Some notes:

  • OpenAI's implementation uses form data, not JSON input
  • Their formats offered are json, text, srt, verbose_json, or vtt. json only has one key with "text", whereas verbose_json includes other basic info. verbose_json is used along with timestamp_granularities[] array to provide segments or word level timestamps. Since we always get segments, we have to throw those away when json format is used.
  • Need a way to get duration of the file to match OpenAI output. I can divide the size of the numpy array by the sample rate to get the number of seconds, but I also have to divide it by 2? Not sure if it's stereo, that wouldn't make much sense.
  • The abstraction between the endpoint route method and the core.py methods for whisper/faster-whisper need to be changed. Mostly for modifying the JSON keys before they're turned into a StringIO stream.

ayancey avatar Aug 07 '24 23:08 ayancey