whisper-asr-webservice
whisper-asr-webservice copied to clipboard
Proof of concept implementation for OpenAI compatible API format
Quick and dirty implementation of what it would look like to support OpenAI's API format. This is an attempt to satisfy #227.
Example OpenAI output from /v1/audio/transcriptions
:
{
"task": "transcribe",
"language": "english",
"duration": 9.90999984741211,
"text": "The dog jumped over the big fence and then it ran over to the farm.",
"segments": [
{
"id": 0,
"seek": 0,
"start": 0.0,
"end": 11.0,
"text": " The dog jumped over the big fence and then it ran over to the farm.",
"tokens": [
50364,
440,
3000,
13864,
670,
264,
955,
15422,
293,
550,
309,
5872,
670,
281,
264,
5421,
13,
50914
],
"temperature": 0.0,
"avg_logprob": -0.3397972285747528,
"compression_ratio": 1.0634920597076416,
"no_speech_prob": 0.02906951494514942
}
]
}
Some notes:
- OpenAI's implementation uses form data, not JSON input
- Their formats offered are
json
,text
,srt
,verbose_json
, orvtt
.json
only has one key with "text", whereasverbose_json
includes other basic info.verbose_json
is used along withtimestamp_granularities[]
array to provide segments or word level timestamps. Since we always get segments, we have to throw those away whenjson
format is used. - Need a way to get duration of the file to match OpenAI output. I can divide the size of the numpy array by the sample rate to get the number of seconds, but I also have to divide it by 2? Not sure if it's stereo, that wouldn't make much sense.
- The abstraction between the endpoint route method and the core.py methods for whisper/faster-whisper need to be changed. Mostly for modifying the JSON keys before they're turned into a StringIO stream.