faster-whisper-server
faster-whisper-server copied to clipboard
PRELOAD_MODELS doesn't work on last Docker image tag (but with build locally, it works)
With local deployment, the PRELOAD_MODELS config variable works perfectly :
PRELOAD_MODELS='["Systran/faster-whisper-medium.en", "Systran/faster-whisper-small.en"]' MAX_MODELS=2 uvicorn main:app --port 8080 --log-level debug --reload
But in a docker compose that not :
The docker compose :
services:
faster-whisper-server-cuda:
image: fedirz/faster-whisper-server:latest-cuda
volumes:
- /data/models/test:/root/.cache/huggingface
restart: unless-stopped
ports:
- 8000:8000
environment:
- LOG_LEVEL=debug
- ENABLE_UI=False
- MAX_MODELS=2
- PRELOAD_MODELS='["Systran/faster-whisper-medium.en", "Systran/faster-whisper-small.en"]'
develop:
watch:
- path: faster_whisper_server
action: rebuild
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
I tried different types of quotes :
- PRELOAD_MODELS='["Systran/faster-whisper-medium.en", "Systran/faster-whisper-small.en"]'
- PRELOAD_MODELS=["Systran/faster-whisper-medium.en", "Systran/faster-whisper-small.en"]
- 'PRELOAD_MODELS="["Systran/faster-whisper-medium.en", "Systran/faster-whisper-small.en"]'
The models are not download on my volume or anywhere else. Any ideas ? Thanks in advance
You're mistaken, you're not inputing the modelname as Systran is just the name of the repo of the guy how makes faster whisper.
I have this in my docker compose:
- PRELOAD_MODELS=["large-v3"]
It's probably worth adding an example in the yaml file
- just the model name doesn't work too. Moreover, just the model name won't work, since huggingface needs the whole model ID to download it.
I've just rebuilt the image directly from the repository, it works perfectly, there must be a difference between the main branch and the cuda-latest tag.
For example with [“large-v3”] and with the locally built fresh image :
The error is classic huggingface since large-v3 is not a known model id on HF.
With the same image but with ["Systran/faster-whisper-large-v3", "Systran/faster-distil-whisper-large-v3"]:
That's work :) can you push a image with de latest code version maybe ?
I'm not the owner of this repo so I'll leave that up to them :)
I'm experiencing the same issue. Have you been able to find a solution for it?
environment:
- PRELOAD_MODELS=["Systran/faster-whisper-medium"]
works for me
environment: - PRELOAD_MODELS=["Systran/faster-whisper-medium"]works for me
Which image tag you use ?
services:
faster-whisper-server-cuda:
image: fedirz/faster-whisper-server:latest-cuda
build:
dockerfile: Dockerfile.cuda
context: .
platforms:
- linux/amd64
- linux/arm64
restart: unless-stopped
ports:
- 8000:8000
environment:
- PRELOAD_MODELS=["Systran/faster-whisper-medium"]
volumes:
- hugging_face_cache:/root/.cache/huggingface
develop:
watch:
- path: faster_whisper_server
action: rebuild
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ['1']
capabilities: ["gpu"]
volumes:
hugging_face_cache:
Well I'm definitely encountering this issue now. It happened when I switched to large v3 but might have nothing to do with that since reusing my previous config does not seem to preload either.
So it seems to have broke recently.
Here's my compose content where I added comments.
faster-whisper-server-cuda:
image: fedirz/faster-whisper-server:latest-cuda
build:
dockerfile: Dockerfile.cuda
context: .
platforms:
- linux/amd64
volumes:
- /home/root/.cache/huggingface:/root/.cache/huggingface
restart: unless-stopped
ports:
- 8001:8001
environment:
- UVICORN_PORT=8001
- ENABLE_UI=false
- MIN_DURATION=1
# default TTL is 300 (5min), -1 to disable, 0 to unload directly, 43200=12h
- WHISPER__TTL=43200
- WHISPER__INFERENCE_DEVICE=cuda
- WHISPER__COMPUTE_TYPE=int8
- WHISPER__MODEL=deepdml/faster-whisper-large-v3-turbo-ct2 # works (finds the right model)
- PRELOAD_MODELS=["deepdml/faster-whisper-large-v3-turbo-ct2"] # doesn't work (no preloading)
# - PRELOAD_MODELS=["faster-whisper-large-v3-turbo-ct2"] # doesn't work either
# Used to work but not anymore
# - WHISPER__MODEL=large-v3
# - PRELOAD_MODELS=["large-v3"]
develop:
watch:
- path: faster_whisper_server
action: rebuild
deploy:
resources:
reservations:
devices:
- capabilities: ["gpu"]
network_mode: host
pull_policy: always
(Very sorry for bothering you @fedirz but because this issue was closed in the past I'm afraid you would miss it when catching up so I'm humbly notifying you and asking to reopen this issue just in case, but of course do what you want and keep it closed if that's how you work :))
- WHISPER__INFERENCE_DEVICE=cuda
Same for me, preloading models doesnt work which is not that big of a deal but still would make the transcribing faster...
Same error for me.
Hi everyone, I still have this issue so did a workaround:
whisper_watcher:
image: docker:dind
volumes:
- /var/run/docker.sock:/var/run/docker.sock
- ./jfk.mp3:/jfk.mp3
command: |
sh -c '
docker events \
--filter "type=container" \
--filter "container=speaches" \
--filter "event=start" \
--filter "event=create" \
--filter "event=restart" | \
while read event; do
delay=5
apk add --no-cache curl
echo "Waiting $$delay seconds before loading jfk.mp3 to speaches"
sleep $$delay
echo "Done waiting, sending jfk.mp3"
attempt=1
max_attempts=5
while [ $$attempt -le $$max_attempts ]; do
if curl -X POST "http://localhost:8001/v1/audio/transcriptions" \
-H "Authorization: Bearer YOURKEY" \
-H "Content-Type: multipart/form-data" \
-F "model=deepdml/faster-whisper-large-v3-turbo-ct2" \
-F "response_format=text" \
-F "file=@/jfk.mp3"; then
echo "Success on attempt $$attempt"
break
else
echo "Attempt $$attempt failed. Retrying in $$delay seconds..."
[ $$attempt -lt $$max_attempts ] && sleep $$delay
attempt=$$((attempt + 1))
fi
done
echo "Done with transcript"
done
'
restart: unless-stopped
network_mode: host
This basically calls a transcription on jfk.wav when the speaches container starts/restarts/gets created.\
Seems to be working fine for me.
I don't see it documented in https://speaches.ai/configuration/#speaches.config.WhisperConfig so I don't think this functionality is exposed.
These are the available variables we can use:
class WhisperConfig(BaseModel):
"""See https://github.com/SYSTRAN/faster-whisper/blob/master/faster_whisper/transcribe.py#L599."""
model: str = Field(default="Systran/faster-whisper-small")
"""
Default HuggingFace model to use for transcription. Note, the model must support being ran using CTranslate2.
This model will be used if no model is specified in the request.
Models created by authors of `faster-whisper` can be found at https://huggingface.co/Systran
You can find other supported models at https://huggingface.co/models?p=2&sort=trending&search=ctranslate2 and https://huggingface.co/models?sort=trending&search=ct2
"""
inference_device: Device = Field(default=Device.AUTO)
device_index: int | list[int] = 0
compute_type: Quantization = Field(default=Quantization.DEFAULT)
cpu_threads: int = 0
num_workers: int = 1
ttl: int = Field(default=300, ge=-1)
"""
Time in seconds until the model is unloaded if it is not being used.
-1: Never unload the model.
0: Unload the model immediately after usage.
"""
use_batched_mode: bool = False
"""
Whether to use batch mode(introduced in 1.1.0 `faster-whisper` release) for inference. This will likely become the default in the future and the configuration option will be removed.
""" # noqa: E501
They just have to be prefixed with WHISPER__
Thanks. It appears you are right. It's too bad this feature got removed without deprecation notice nor errors though. Afaic this shouod be closed OR the feature readded.
Unfortunately this is correct, it's been deprecated in this commit https://github.com/speaches-ai/speaches/commit/4aa5cf9c49b0becfb5282f0e8e0bb4d48c00c9e5
Why is it not recommended to pre-load? Could it be because it increases potential power usage by the GPU? There are solutions to keep GPU power under control even with models loaded, for example, one could use this https://github.com/sasha0552/nvidia-pstated Imho, should have been left there - there are use cases where it made sense like in mine when I wanted to have the model preloaded for faster initial inference.
Ollama supports model pre-loading, I don't see a reason why Speaches can't :(
I hope we can get this reverted
I mentioned this in https://github.com/speaches-ai/speaches/issues/314
I am in favour. Although my workaround is fine for my use case.
I am in favour. Although my workaround is fine for my use case.
Workaround with a web request coming in to preload the model just feels like a dirty way of doing it.
Ollama supports model pre-loading,
I don't believe that Ollama supports this. See https://github.com/ollama/ollama/issues/6295 (Please correct me if I'm wrong)
Workaround with a web request coming in to preload the model just feels like a dirty way of doing it.
This approach is recommended in the Ollama issue referenced above.
If you are using Docker Compose, you may want to look into lifecycle hooks. I haven't used this, but you should be able to make a CURL request a couple of seconds after a container starts to load the model.
I hope we can get this reverted
It's unlikely, but I'll keep the feedback from this issue in mind for the future.
Thanks for the suggestion to use lifecycle hooks, here's the updated workaround for speaches:
volumes:
- ./jfk.mp3:/jfk.mp3 # for the post_start command
post_start:
- command: |
bash -c '
delay=5
echo "Waiting $$delay seconds before loading jfk.mp3 to speaches"
sleep $$delay
echo "Done waiting, sending jfk.mp3"
attempt=1
max_attempts=5
while [ $$attempt -le $$max_attempts ]; do
if curl -X POST "http://localhost:8001/v1/audio/transcriptions" \
-H "Authorization: Bearer $$API_KEY" \
-H "Content-Type: multipart/form-data" \
-F "model=$$WHISPER__MODEL" \
-F "response_format=text" \
-F "file=@/jfk.mp3"; then
echo "Success on attempt $$attempt\n"
break
else
echo "Attempt $$attempt failed. Retrying in $$delay seconds..."
[ $$attempt -lt $$max_attempts ] && sleep $$delay
attempt=$$((attempt + 1))
fi
done
echo "Done with transcript"
'
user: root