whisper-diarization icon indicating copy to clipboard operation
whisper-diarization copied to clipboard

Docker build

Open meonkeys opened this issue 1 year ago • 4 comments

Just thought it would be handy to have a Docker image for this tool. I've been unable to get it working so far but I'll keep trying. If anyone else has it running in Docker, please share.

meonkeys avatar Nov 05 '23 17:11 meonkeys

I got an image built. It's not clean enough for a pull request but I'll share what I've got anyway. Maybe someone else can pick this up and contribute it (assuming the maintainers want it).

I'm just creating a Dockerfile in a working copy (local clone) of this repository (HEAD at 2bdffc6b6e6e0d9ee8632dabf5009e995b31028d) and building with Docker. Here's the Dockerfile:

# FIXME: Makes a huge image.
# TODO: Optimize with a multi-stage build, perhaps also using venv.

# Pin to 3.10-bookworm to get Python 3.10
# because https://github.com/MahmoudAshraf97/whisper-diarization/issues/90
FROM python:3.10-bookworm

ARG WD_USER=joe
ARG WD_UID=1000
ARG WD_GROUP=joe
ARG WD_GID=1000

# We rarely see a full upgrade in a Dockerfile. Why?
# && apt-get --assume-yes dist-upgrade \
RUN apt-get update \
  && apt-get --assume-yes --no-install-recommends install \
  cython3 \
  ffmpeg \
  unzip \
  wget \
  && rm -rf /var/lib/apt/lists/*

WORKDIR /usr/src/app

COPY . .

RUN addgroup --gid $WD_GID $WD_GROUP \
  && adduser --uid $WD_UID --gid $WD_GID --shell /bin/bash --no-create-home $WD_USER \
  && chown -R $WD_USER:$WD_GROUP /usr/src/app

USER $WD_USER:$WD_GROUP

RUN mkdir venv \
  && python -m venv venv \
  && . venv/bin/activate \
  && pip install Cython \
  && pip install --no-cache-dir --requirement requirements.txt

Build with docker build --tag whisper-diarization . The rest assumes a Bash shell on Linux or something close to / compatible with that.

As user joe with UID 1000 and GID 1000, run with, for example:

BASE=$HOME/whisper-diarization
mkdir -p $BASE/data
mkdir -p $BASE/HOME_CACHE
mkdir -p $BASE/HOME_CONFIG
APP=/usr/src/app
mv /tmp/recording.mp3 data/
docker run --rm -it \
  -v $BASE/data:/data \
  -v $BASE/HOME_CONFIG:$APP/.config \
  -v $BASE/HOME_CACHE:$APP/.cache \
  --user joe:joe \
  whisper-diarization \
  bash

Now you're in the container at a non-root shell prompt, presumably. Run:

export HOME=/usr/src/app
source venv/bin/activate
python diarize_parallel.py -a /data/recording.mp3
exit

Now, inspect and manually clean up $BASE/data/recording.txt on the host.

meonkeys avatar Nov 07 '23 05:11 meonkeys

Don't forget the --gpus all for docker run (if you want to use your GPU).

cvette avatar Nov 09 '23 20:11 cvette

Just released "transcription stream" on GitHub today, which includes a docker image that runs diarize.py. Takes me about 15 minutes to build, but works great and is fast/automated. Would love to get your thoughts: https://github.com/transcriptionstream/transcriptionstream

transcriptionstream avatar Nov 13 '23 22:11 transcriptionstream

It took me 30 minutes to build it and the 7.5GB size, but it works. Thanks for sharing :)

occult avatar Apr 25 '24 23:04 occult