openai-whisper-microservice
openai-whisper-microservice copied to clipboard
This is an OpenAI Whisper automatic speech recognition microservice
OpenAI Whisper API
IF YOU FIND THIS REPOSITORY HELPFUL, PLEASE CONSIDER STARRING IT.
This microservice provides the same API as OpenAI's API.
The authors claim that Whisper is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multitasking model that can perform multilingual speech recognition, speech translation, and language identification.
A Transformer sequence-to-sequence model is trained on various speech processing tasks, including multilingual speech recognition, speech translation, spoken language identification, and voice activity detection. These tasks are jointly represented as a sequence of tokens to be predicted by the decoder, allowing a single model to replace many stages of a traditional speech-processing pipeline. The multitask training format uses a set of special tokens that serve as task specifiers or classification targets.
Available models and languages
There are five model sizes, four with English-only versions, offering speed and accuracy tradeoffs. Below are the names of the available models and their approximate memory requirements and relative speed.
| Size | Parameters | English-only model | Multilingual model | Required VRAM | Relative speed |
|---|---|---|---|---|---|
| tiny | 39 M | tiny.en |
tiny |
~1 GB | ~32x |
| base | 74 M | base.en |
base |
~1 GB | ~16x |
| small | 244 M | small.en |
small |
~2 GB | ~6x |
| medium | 769 M | medium.en |
medium |
~5 GB | ~2x |
| large | 1550 M | N/A | large |
~10 GB | 1x |
The .en models for English-only applications tend to perform better, especially for the tiny.en and base.en models. We observed that the difference becomes less significant for the small.en and medium.en models.
Whisper's performance varies widely depending on the language. The figure below shows a WER (Word Error Rate) breakdown by languages of the Fleurs dataset using the large-v2 model. More WER and BLEU scores corresponding to the other models and datasets can be found in Appendix D in the paper. The smaller, the better.
Original repository
https://github.com/openai/whisper
Run
git clone https://github.com/matusstas/openai-whisper-microservice.git
cd openai-whisper-microservice
docker-compose up -d --build
Docker Hub
- repository
https://hub.docker.com/r/matusstas/openai-whisper-microservice
- latest
docker pull matusstas/openai-whisper-microservice:latest
- 1.0.0
docker pull matusstas/openai-whisper-microservice:1.0.0
Swagger API documentation
After successfully starting the image, the Swagger documentation will be available at http://localhost:80/docs
Endpoints
The API currently provides 12 endpoints, which are divided into 4 groups according to their use.

Model
GET /models-available
Description
Return a list of all available Whisper ASR models.
Response 200
[
"tiny.en",
"tiny",
"base.en",
"base",
"small.en",
"small",
"medium.en",
"medium",
"large-v1",
"large-v2",
"large"
]
GET /models-downloading
Description
Return a list of all downloading Whisper ASR models.
Response 200
[
"tiny.en"
]
GET /models-downloaded
Description
Return a list of all downloaded Whisper ASR models.
Response 200
[
"tiny.en",
"tiny"
]
POST /model
Description
Download a Whisper ASR model using background task.
Parameters
model_name- required
- string (path)
- Name of the model
Response 201
{
"detail": "Model is being downloaded"
}
Response 400
{
"detail": "Invalid model"
}
Response 400
{
"detail": "Model not downloaded yet"
}
Response 409
{
"detail": "Model already exist"
}
GET /model/{model_name}
Description
Return a Whisper ASR model.
Parameters
model_name- required
- string (path)
- Name of the model
Response 200
JSON RESPONSE IS TOO LONG TO DISPLAY
Response 400
{
"detail": "Model not downloaded yet"
}
Response 404
{
"detail": "Model not found"
}
DELETE /model/{model_name}
Description
Delete a downloaded Whisper ASR model.
Parameters
model_name- required
- string (path)
- Name of the model
Response 200
{
"detail": "Model was deleted"
}
Response 400
{
"detail": "Model not downloaded yet"
}
Response 404
{
"detail": "Model not found"
}
POST /model/{model_name}/language
Description
Return a sorted list of all detected languages by their score.
Parameters
model_name- required
- string (path)
- Name of the model
Request body
file- required
- string ($binary)
- Chosen audiofile
Response 200
{
"en": 0.38421738147735596,
"cy": 0.2614089846611023,
"zh": 0.10288530588150024,
"nn": 0.04161091521382332,
"ko": 0.03617018833756447,
...
"uz": 9.862478833611021e-9
}
Response 400
{
"detail": "Model not multilingual"
}
Response 400
{
"detail": "Model not downloaded yet"
}
Response 404
{
"detail": "Model not found"
}
POST /model/{model_name}/transcript
Description
Transcribe audio with a Whisper ASR model.
Parameters
model_name- required
- string (path)
- Name of the model
Request body
-
task- required
- string
- Task: [
transcribe,translate]
-
language_code- required
- string
- Language code: [
af,am,ar,as,az, ...,zh]
-
media_type- required
- string
- Media type: [
application/json,text/plain]
-
format- required
- string
- Output format: [
json,srt,tsv,txt,vtt]
-
file- required
- string ($binary)
- Chosen audiofile
Response 200
{
"text": " I found that nothing in life is worthwhile unless you take risks. Nothing. Nelson Mandela said, there is no passion to be found playing small and settling for a life that's less than the one you're capable of living. Now I'm sure in your experiences in school and applying to college and...",
"segments": [
{
"id": 0,
"seek": 0,
"start": 0,
"end": 7,
"text": " I found that nothing in life is worthwhile unless you take risks.",
"tokens": [
50363,
314,
1043,
326,
2147,
287,
1204,
318,
24769,
4556,
345,
1011,
7476,
13,
50713
],
"temperature": 0,
"avg_logprob": -0.19224673257747166,
"compression_ratio": 1.5508021390374331,
"no_speech_prob": 0.013612011447548866
},
{
"id": 1,
"seek": 0,
"start": 7,
"end": 9,
"text": " Nothing.",
"tokens": [
50713,
10528,
13,
50813
],
"temperature": 0,
"avg_logprob": -0.19224673257747166,
"compression_ratio": 1.5508021390374331,
"no_speech_prob": 0.013612011447548866
},
{
"id": 2,
"seek": 0,
"start": 9,
"end": 15,
"text": " Nelson Mandela said, there is no passion to be found playing small",
"tokens": [
50813,
12996,
40233,
531,
11,
612,
318,
645,
7506,
284,
307,
1043,
2712,
1402,
51113
],
"temperature": 0,
"avg_logprob": -0.19224673257747166,
"compression_ratio": 1.5508021390374331,
"no_speech_prob": 0.013612011447548866
},
{
"id": 3,
"seek": 0,
"start": 15,
"end": 20,
"text": " and settling for a life that's less than the one you're capable of living.",
"tokens": [
51113,
290,
25446,
329,
257,
1204,
326,
338,
1342,
621,
262,
530,
345,
821,
6007,
286,
2877,
13,
51363
],
"temperature": 0,
"avg_logprob": -0.19224673257747166,
"compression_ratio": 1.5508021390374331,
"no_speech_prob": 0.013612011447548866
},
{
"id": 4,
"seek": 0,
"start": 20,
"end": 24,
"text": " Now I'm sure in your experiences in school and applying to college and...",
"tokens": [
51363,
2735,
314,
1101,
1654,
287,
534,
6461,
287,
1524,
290,
11524,
284,
4152,
290,
986,
51563
],
"temperature": 0,
"avg_logprob": -0.19224673257747166,
"compression_ratio": 1.5508021390374331,
"no_speech_prob": 0.013612011447548866
}
],
"language": "en"
}
Response 400
{
"detail": "Model not downloaded yet"
}
Response 404
{
"detail": "Model not found"
}
Language
GET /languages
Description
Return all available languages.
Response 200
{
"af": "afrikaans",
"am": "amharic",
"ar": "arabic",
"as": "assamese",
"az": "azerbaijani",
...
"zh": "chinese"
}
GET /language/{language_code}
Description
Return an english language name.
Response 200
"italian"
Response 404
{
"detail": "Language not found"
}
Miscellaneous
GET /cuda
Description
Check whether a GPU with CUDA support is available on the current system. Return boolean value.
Response 200
true
Root
GET /language/{language_code}
Description
Return message from container to check if it is running.
Response 200
{
"detail": "Whisper API is running"
}
Data Persistence with Docker Volumes
In order not to lose important data, we added volumes. This is ensured by a specific command in the docker-compose.yml file.
Volume: models
All downloaded Whisper ASR models are stored in the /root/.cache/whisper folder. These are the models we work with.
Volume: db
Due to the fact that I also provide management of individual models (download or delete), I needed a simple database that would allow me to do this. So I work with TinyDB (document-oriented database written in pure Python with no external dependencies), where the important file models.json is stored in the db folder.
For example, the data in the file might looks like this:
{
"_default":{
"1":{
"name":"tiny",
"downloaded":false
},
"2":{
"name":"tiny.en",
"downloaded":true
}
}
}
- downloaded: true = Whisper ASR model was downloaded
- downloaded: false = Whisper ASR model is still downloading
Star History
Todo
- GPU support (I'm working on it right now)