faster-whisper-server icon indicating copy to clipboard operation
faster-whisper-server copied to clipboard

PRELOAD_MODELS doesn't work on last Docker image tag (but with build locally, it works)

Open leoguillaume opened this issue 1 year ago • 10 comments

With local deployment, the PRELOAD_MODELS config variable works perfectly :

PRELOAD_MODELS='["Systran/faster-whisper-medium.en", "Systran/faster-whisper-small.en"]' MAX_MODELS=2 uvicorn main:app --port 8080 --log-level debug --reload

image

But in a docker compose that not : image

The docker compose :

services:
  faster-whisper-server-cuda:
    image: fedirz/faster-whisper-server:latest-cuda
    volumes:
      - /data/models/test:/root/.cache/huggingface
    restart: unless-stopped
    ports:
      - 8000:8000
    environment:
      - LOG_LEVEL=debug
      - ENABLE_UI=False
      - MAX_MODELS=2
      - PRELOAD_MODELS='["Systran/faster-whisper-medium.en", "Systran/faster-whisper-small.en"]'
    develop:
      watch:
        - path: faster_whisper_server
          action: rebuild
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

I tried different types of quotes :

  • PRELOAD_MODELS='["Systran/faster-whisper-medium.en", "Systran/faster-whisper-small.en"]'
  • PRELOAD_MODELS=["Systran/faster-whisper-medium.en", "Systran/faster-whisper-small.en"]
  • 'PRELOAD_MODELS="["Systran/faster-whisper-medium.en", "Systran/faster-whisper-small.en"]'

The models are not download on my volume or anywhere else. Any ideas ? Thanks in advance

leoguillaume avatar Sep 10 '24 13:09 leoguillaume

You're mistaken, you're not inputing the modelname as Systran is just the name of the repo of the guy how makes faster whisper.

I have this in my docker compose: - PRELOAD_MODELS=["large-v3"]

thiswillbeyourgithub avatar Sep 10 '24 13:09 thiswillbeyourgithub

It's probably worth adding an example in the yaml file

thiswillbeyourgithub avatar Sep 10 '24 13:09 thiswillbeyourgithub

  • just the model name doesn't work too. Moreover, just the model name won't work, since huggingface needs the whole model ID to download it.

I've just rebuilt the image directly from the repository, it works perfectly, there must be a difference between the main branch and the cuda-latest tag.

For example with [“large-v3”] and with the locally built fresh image : image The error is classic huggingface since large-v3 is not a known model id on HF.

With the same image but with ["Systran/faster-whisper-large-v3", "Systran/faster-distil-whisper-large-v3"]:

image

That's work :) can you push a image with de latest code version maybe ?

leoguillaume avatar Sep 10 '24 13:09 leoguillaume

I'm not the owner of this repo so I'll leave that up to them :)

thiswillbeyourgithub avatar Sep 10 '24 13:09 thiswillbeyourgithub

I'm experiencing the same issue. Have you been able to find a solution for it?

willy-r avatar Sep 11 '24 19:09 willy-r

    environment:
      - PRELOAD_MODELS=["Systran/faster-whisper-medium"]

works for me

gsoul avatar Sep 12 '24 11:09 gsoul


    environment:

      - PRELOAD_MODELS=["Systran/faster-whisper-medium"]

works for me

Which image tag you use ?

leoguillaume avatar Sep 12 '24 16:09 leoguillaume

services:
  faster-whisper-server-cuda:
    image: fedirz/faster-whisper-server:latest-cuda
    build:
      dockerfile: Dockerfile.cuda
      context: .
      platforms:
        - linux/amd64
        - linux/arm64
    restart: unless-stopped
    ports:
      - 8000:8000
    environment:
      - PRELOAD_MODELS=["Systran/faster-whisper-medium"]
    volumes:
      - hugging_face_cache:/root/.cache/huggingface
    develop:
      watch:
        - path: faster_whisper_server
          action: rebuild
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ['1']
              capabilities: ["gpu"]

volumes:
  hugging_face_cache:

gsoul avatar Sep 12 '24 18:09 gsoul

Well I'm definitely encountering this issue now. It happened when I switched to large v3 but might have nothing to do with that since reusing my previous config does not seem to preload either.

So it seems to have broke recently.

Here's my compose content where I added comments.

      faster-whisper-server-cuda:
        image: fedirz/faster-whisper-server:latest-cuda
        build:
          dockerfile: Dockerfile.cuda
          context: .
          platforms:
            - linux/amd64
        volumes:
          - /home/root/.cache/huggingface:/root/.cache/huggingface
        restart: unless-stopped
        ports:
          - 8001:8001
        environment:
          - UVICORN_PORT=8001
          - ENABLE_UI=false
          - MIN_DURATION=1
          # default TTL is 300 (5min), -1 to disable, 0 to unload directly, 43200=12h
          - WHISPER__TTL=43200
          - WHISPER__INFERENCE_DEVICE=cuda
          - WHISPER__COMPUTE_TYPE=int8
    
          - WHISPER__MODEL=deepdml/faster-whisper-large-v3-turbo-ct2  # works (finds the right model)
          - PRELOAD_MODELS=["deepdml/faster-whisper-large-v3-turbo-ct2"]  # doesn't work (no preloading)
          # - PRELOAD_MODELS=["faster-whisper-large-v3-turbo-ct2"]  # doesn't work either
          # Used to work but not anymore
          # - WHISPER__MODEL=large-v3
          # - PRELOAD_MODELS=["large-v3"]
            develop:
          watch:
            - path: faster_whisper_server
              action: rebuild
        deploy:
          resources:
            reservations:
              devices:
                - capabilities: ["gpu"]
        network_mode: host
        pull_policy: always

thiswillbeyourgithub avatar Oct 05 '24 16:10 thiswillbeyourgithub

(Very sorry for bothering you @fedirz but because this issue was closed in the past I'm afraid you would miss it when catching up so I'm humbly notifying you and asking to reopen this issue just in case, but of course do what you want and keep it closed if that's how you work :))

thiswillbeyourgithub avatar Oct 09 '24 16:10 thiswillbeyourgithub

          - WHISPER__INFERENCE_DEVICE=cuda

Same for me, preloading models doesnt work which is not that big of a deal but still would make the transcribing faster...

theodufort avatar Nov 09 '24 13:11 theodufort

Same error for me.

defaultsecurity avatar Jan 03 '25 21:01 defaultsecurity

Hi everyone, I still have this issue so did a workaround:


  whisper_watcher:
    image: docker:dind
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
      - ./jfk.mp3:/jfk.mp3
    command: |
      sh -c '
        docker events \
          --filter "type=container" \
          --filter "container=speaches" \
          --filter "event=start" \
          --filter "event=create" \
          --filter "event=restart" | \
      while read event; do
        delay=5
        apk add --no-cache curl
        echo "Waiting $$delay seconds before loading jfk.mp3 to speaches"
        sleep $$delay
        echo "Done waiting, sending jfk.mp3"

        attempt=1
        max_attempts=5
        while [ $$attempt -le $$max_attempts ]; do
          if curl -X POST "http://localhost:8001/v1/audio/transcriptions" \
            -H "Authorization: Bearer YOURKEY" \
            -H "Content-Type: multipart/form-data" \
            -F "model=deepdml/faster-whisper-large-v3-turbo-ct2" \
            -F "response_format=text" \
            -F "file=@/jfk.mp3"; then
            echo "Success on attempt $$attempt"
            break
          else
            echo "Attempt $$attempt failed. Retrying in $$delay seconds..."
            [ $$attempt -lt $$max_attempts ] && sleep $$delay
            attempt=$$((attempt + 1))
          fi
        done

        echo "Done with transcript"
      done
      '
    restart: unless-stopped
    network_mode: host

This basically calls a transcription on jfk.wav when the speaches container starts/restarts/gets created.\

Seems to be working fine for me.

thiswillbeyourgithub avatar Feb 05 '25 15:02 thiswillbeyourgithub

I don't see it documented in https://speaches.ai/configuration/#speaches.config.WhisperConfig so I don't think this functionality is exposed.

These are the available variables we can use:

class WhisperConfig(BaseModel):
    """See https://github.com/SYSTRAN/faster-whisper/blob/master/faster_whisper/transcribe.py#L599."""

    model: str = Field(default="Systran/faster-whisper-small")
    """
    Default HuggingFace model to use for transcription. Note, the model must support being ran using CTranslate2.
    This model will be used if no model is specified in the request.

    Models created by authors of `faster-whisper` can be found at https://huggingface.co/Systran
    You can find other supported models at https://huggingface.co/models?p=2&sort=trending&search=ctranslate2 and https://huggingface.co/models?sort=trending&search=ct2
    """
    inference_device: Device = Field(default=Device.AUTO)
    device_index: int | list[int] = 0
    compute_type: Quantization = Field(default=Quantization.DEFAULT)
    cpu_threads: int = 0
    num_workers: int = 1
    ttl: int = Field(default=300, ge=-1)
    """
    Time in seconds until the model is unloaded if it is not being used.
    -1: Never unload the model.
    0: Unload the model immediately after usage.
    """
    use_batched_mode: bool = False
    """
    Whether to use batch mode(introduced in 1.1.0 `faster-whisper` release) for inference. This will likely become the default in the future and the configuration option will be removed.
    """  # noqa: E501

They just have to be prefixed with WHISPER__

mateuszdrab avatar Feb 13 '25 11:02 mateuszdrab

Thanks. It appears you are right. It's too bad this feature got removed without deprecation notice nor errors though. Afaic this shouod be closed OR the feature readded.

thiswillbeyourgithub avatar Feb 13 '25 22:02 thiswillbeyourgithub

Unfortunately this is correct, it's been deprecated in this commit https://github.com/speaches-ai/speaches/commit/4aa5cf9c49b0becfb5282f0e8e0bb4d48c00c9e5

Why is it not recommended to pre-load? Could it be because it increases potential power usage by the GPU? There are solutions to keep GPU power under control even with models loaded, for example, one could use this https://github.com/sasha0552/nvidia-pstated Imho, should have been left there - there are use cases where it made sense like in mine when I wanted to have the model preloaded for faster initial inference.

Ollama supports model pre-loading, I don't see a reason why Speaches can't :(

I hope we can get this reverted

I mentioned this in https://github.com/speaches-ai/speaches/issues/314

mateuszdrab avatar Feb 13 '25 23:02 mateuszdrab

I am in favour. Although my workaround is fine for my use case.

thiswillbeyourgithub avatar Feb 14 '25 21:02 thiswillbeyourgithub

I am in favour. Although my workaround is fine for my use case.

Workaround with a web request coming in to preload the model just feels like a dirty way of doing it.

mateuszdrab avatar Feb 14 '25 21:02 mateuszdrab

Ollama supports model pre-loading,

I don't believe that Ollama supports this. See https://github.com/ollama/ollama/issues/6295 (Please correct me if I'm wrong)

Workaround with a web request coming in to preload the model just feels like a dirty way of doing it.

This approach is recommended in the Ollama issue referenced above.

If you are using Docker Compose, you may want to look into lifecycle hooks. I haven't used this, but you should be able to make a CURL request a couple of seconds after a container starts to load the model.

I hope we can get this reverted

It's unlikely, but I'll keep the feedback from this issue in mind for the future.

fedirz avatar Feb 19 '25 04:02 fedirz

Thanks for the suggestion to use lifecycle hooks, here's the updated workaround for speaches:


    volumes:
      - ./jfk.mp3:/jfk.mp3  # for the post_start command


    post_start:
      - command: |
          bash -c '
           delay=5
           echo "Waiting $$delay seconds before loading jfk.mp3 to speaches"
           sleep $$delay
           echo "Done waiting, sending jfk.mp3"

           attempt=1
           max_attempts=5
           while [ $$attempt -le $$max_attempts ]; do
             if curl -X POST "http://localhost:8001/v1/audio/transcriptions" \
               -H "Authorization: Bearer $$API_KEY" \
               -H "Content-Type: multipart/form-data" \
               -F "model=$$WHISPER__MODEL" \
               -F "response_format=text" \
               -F "file=@/jfk.mp3"; then
               echo "Success on attempt $$attempt\n"
               break
             else
               echo "Attempt $$attempt failed. Retrying in $$delay seconds..."
               [ $$attempt -lt $$max_attempts ] && sleep $$delay
               attempt=$$((attempt + 1))
             fi
           done

           echo "Done with transcript"
          '
        user: root


thiswillbeyourgithub avatar Feb 20 '25 12:02 thiswillbeyourgithub