Docker image for running whisper-ctranslate2
Hello, just wanted to start a discussion about running whisper-ctranslate2 in Docker. Referencing #109 of faster-whisper, I came up with the following Dockerfile, which works.
# Use Ubuntu as base
FROM ubuntu:20.04
# Alternatively, use a base image with CUDA and cuDNN support
# FROM nvidia/cuda:11.7.1-cudnn8-runtime-ubuntu20.04
# Install necessary dependencies
RUN apt-get update && apt-get install -y python3-pip
# Set the working directory
WORKDIR /app
# Copy the app code and requirements filed
COPY . /app
# Install dependencies
RUN pip3 install --no-cache-dir -r requirements.txt
# Install whisper-ctranslate2
RUN pip install -U whisper-ctranslate2
# Set the entry point
ENTRYPOINT ["whisper-ctranslate2"]
Build with: docker build -t asr .
Run with: docker run --rm -v /path/to/folder:/app --gpus '"device=0,1"' asr myfile.mp3 --compute_type int8
My observations are:
- If the entrypoint is set, the container will not show transcribed lines as it runs. The result are only printed afterwards
- However, The progress can be shown if you run the docker container (without the entrypoint) in interactive mode
Correct me if I'm wrong, but is that repository literally just a wrapper around faster_whisper? I'm confused why its called insanely-fast-whisper.
Correct me if I'm wrong, but is that repository literally just a wrapper around faster_whisper? I'm confused why its called insanely-fast-whisper.
I'm no expert but here's what it looks like to me. Compared to openai/whisper (or faster_whisper), insanely-fast-whisper:
- Uses Whisper models which are in the 🤗 transformers format
- Supports batching
- Supports 🤗 bettertransformer
- Supports the new 🤗 distil-whisper modules (which are much much faster)
Using one or all of these features leads to faster transcriptions.
Hm, so to answer OPs question, no it shouldn't be too much work. Infact, point 3 is one line of code. Point 1 and 3 are both the same code to load huggingface models I believe batching is already handled to some extent when using this webservice based on your application. Adding batching option also shouldn't be hard though. I can take a crack at it
Did anything come of this? The benchmarks posted by insanely-fast-whisper are hugely impressive versus just faster-whisper (1min18s for 150 minutes of audio on insanely-fast-whisper versus 8min15s for faster-whisper)
Sorry, got busy. I'm sure speech research has progressed so far since the time this issue was opened. I've got a little time on my hands. Are there newer/better/faster models already supported or requested?