There are some errors if you want to run it from instructions. I think I manage to find some of them but I am not sure maybe I miss something.
Very cool project. I looked to build something like this but I found this repo. I get some errors along the way though and manage to fix it but was curious what's the state or maybe reason of it What I have done so far.
git clone https://github.com/collabora/WhisperLive.git
cd WhisperLive
docker build . -f docker/Dockerfile.tensorrt -t whisperlive-tensorrt
docker run -p 9090:9090 --runtime=nvidia --gpus all --entrypoint /bin/bash -it whisperlive-tensorrt
python3 run_server.py --port 9090 --backend tensorrt --trt_model_path "/app/TensorRT-LLM-examples/whisper/whisper_large-v3_float16" --trt_multilingual --max_clients 1 --max_connection_time 600
Now to test I used uv and needed
sudo apt install portaudio19-dev
to install this is as I get error
sudo apt install portaudio19-dev
my pyproject.toml
[project] name = "code" version = "0.1.0" description = "Add your description here" readme = "README.md" requires-python = ">=3.10" dependencies = [ "whisper-live>=0.7.1", ]
I write this for test
from whisper_live.client import TranscriptionClient
# This is the client that will connect to your running server
client = TranscriptionClient(
"localhost", # Hostname of your server
9090, # Port your server is running on
lang="en", # Language of the audio
translate=False,
model="whisper_large-v3_float16", # This is sent to the server, but the server will use the model it was started with
use_vad=False, # Use Voice Activity Detectio # Your desired target language for translation
)
print("Client initialized, sending audio...")
# This calls the server with your audio file
# Make sure "test.mp3" exists in the same directory
try:
client("test.mp3")
except FileNotFoundError:
print("Error: The file 'test.mp3' was not found.")
print("Please make sure the audio file is in the same directory as this script.")
I tried to also change the model to large-v3 but same thing happen When I run this test I got this on the server (very hard to catch as it spam the last lines infinitely and then crash:
python3 run_server.py --port 9090 --backend tensorrt --trt_model_path "/app/TensorRT-LLM-examples/whisper/whisper_large-v3_float16" --trt_multilingual --max_clients 1 --max_connection_time 600
/usr/local/lib/python3.10/dist-packages/torch/cuda/init.py:61: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you. import pynvml # type: ignore[import] INFO:root:Custom model option was provided. Switching to single model mode. INFO:websockets.server:connection open INFO:root:New client connected
:1184: FutureWarning: The cuda.cuda module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.driver module instead. :1184: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead. [TensorRT-LLM] TensorRT-LLM version: 0.18.2 [TensorRT-LLM][INFO] Engine version 0.18.2 found in the config file, assuming engine(s) built by new builder API. [TensorRT-LLM][INFO] Setting encoder max input length and hidden size for accepting visual features. [TensorRT-LLM][INFO] Engine version 0.18.2 found in the config file, assuming engine(s) built by new builder API. [TensorRT-LLM][INFO] Engine version 0.18.2 found in the config file, assuming engine(s) built by new builder API. [TensorRT-LLM][INFO] Setting encoder max input length and hidden size for accepting visual features. [TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0 [TensorRT-LLM][INFO] Engine version 0.18.2 found in the config file, assuming engine(s) built by new builder API. [TensorRT-LLM][INFO] Setting encoder max input length and hidden size for accepting visual features. [TensorRT-LLM][INFO] Refreshed the MPI local session [TensorRT-LLM][INFO] Engine version 0.18.2 found in the config file, assuming engine(s) built by new builder API. [TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0 [TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 4 [TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 4 [TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1 [TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 3000 [TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0 [TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (3000) * 32 [TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0 [TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1 [TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192 [TensorRT-LLM][INFO] TRTGptModel maxInputLen: 2999 = maxSequenceLen - 1 since chunked context is enabled [TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: 3000 = maxSequenceLen. [TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT [TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None [TensorRT-LLM][INFO] Loaded engine size: 1228 MiB [TensorRT-LLM][INFO] Inspecting the engine to identify potential runtime issues... [TensorRT-LLM][INFO] The profiling verbosity of the engine does not allow this analysis to proceed. Re-build the engine with 'detailed' profiling verbosity to get more diagnostics. [TensorRT-LLM][INFO] [MemUsageChange] Allocated 205.08 MiB for execution context memory. [TensorRT-LLM][INFO] [MS] Running engine with multi stream info [TensorRT-LLM][INFO] [MS] Number of aux streams is 1 [TensorRT-LLM][INFO] [MS] Number of total worker streams is 2 [TensorRT-LLM][INFO] [MS] The main stream provided by execute/enqueue calls is the first worker stream [TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 1218 (MiB) [TensorRT-LLM][INFO] TRTEncoderModel mMaxInputLen: reset to 3000 from build config. [TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0 [TensorRT-LLM][INFO] Rank 0 is using GPU 0 [TensorRT-LLM][WARNING] Fix optionalParams : KV cache reuse disabled because model was not built with paged context FMHA support [TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 4 [TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 4 [TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1 [TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 225 [TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0 [TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (225) * 32 [TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0 [TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1 [TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 900 [TensorRT-LLM][INFO] TRTGptModel maxInputLen: 224 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled [TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens). [TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT [TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None [TensorRT-LLM][INFO] The logger passed into createInferRuntime differs from one already provided for an existing builder, runtime, or refitter. Uses of the global logger, returned by nvinfer1::getLogger(), will return the existing value. [TensorRT-LLM][INFO] Loaded engine size: 2065 MiB [TensorRT-LLM][INFO] Inspecting the engine to identify potential runtime issues... [TensorRT-LLM][INFO] The profiling verbosity of the engine does not allow this analysis to proceed. Re-build the engine with 'detailed' profiling verbosity to get more diagnostics. [TensorRT-LLM][INFO] [MemUsageChange] Allocated 120.01 MiB for execution context memory. [TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 3274 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Allocated 29.73 MB GPU memory for runtime buffers. [TensorRT-LLM][INFO] [MemUsageChange] Allocated 2.18 MB GPU memory for decoder. [TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 47.71 GiB, available: 38.55 GiB [TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 3554 [TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true [TensorRT-LLM][INFO] Number of tokens per block: 64. [TensorRT-LLM][INFO] [MemUsageChange] Allocated 17.35 GiB for max tokens in paged KV cache (113728). [TensorRT-LLM][INFO] Number of tokens per block: 64. [TensorRT-LLM][INFO] [MemUsageChange] Allocated 17.35 GiB for max tokens in paged KV cache (113728). [TensorRT-LLM][INFO] This is an Encoder-Decoder model, set 0.5 cross KV cache fraction based on the config. [TensorRT-LLM][INFO] Number of blocks in self KV cache primary pool: 1777, in cross KV cache primary pool: 1777 [TensorRT-LLM][INFO] Number of blocks in self KV cache secondary pool: 0, in cross KV cache secondary pool: 0 INFO:root:[INFO:] Warming up TensorRT engine.. ^CTraceback (most recent call last): File "/app/run_server.py", line 57, in server.run( File "/app/whisper_live/server.py", line 441, in run server.serve_forever() File "/usr/local/lib/python3.10/dist-packages/websockets/sync/server.py", line 275, in serve_forever poller.select() File "/usr/lib/python3.10/selectors.py", line 469, in select fd_event_list = self._selector.poll(timeout, max_ev) KeyboardInterrupt [TensorRT-LLM][WARNING] Default padding attention mask will be used as not all requests have cross attention mask. [TensorRT-LLM][WARNING] CrossAttentionMask is not provided for the request. Default padding attention mask will be created. [TensorRT-LLM][WARNING] CrossAttentionMask is not provided for the generation request. Full valid attentionMask will be used by default. ^CException ignored in: <module 'threading' from '/usr/lib/python3.10/threading.py'> Traceback (most recent call last): File "/usr/lib/python3.10/threading.py", line 1567, in _shutdown lock.acquire() KeyboardInterrupt: [TensorRT-LLM][WARNING] CrossAttentionMask is not provided for the generation request. Full valid attentionMask will be used by default. [TensorRT-LLM][WARNING] CrossAttentionMask is not provided for the generation request. Full valid attentionMask will be used by default. [TensorRT-LLM][WARNING] CrossAttentionMask is not provided for the generation request. Full valid attentionMask will be used by default. [TensorRT-LLM][WARNING] CrossAttentionMask is not provided for the generation request. Full valid attentionMask will be used by default. [TensorRT-LLM][WARNING] CrossAttentionMask is not provided for the generation request. Full valid attentionMask will be used by default. [TensorRT-LLM][WARNING] CrossAttentionMask is not provided for the generation request. Full valid attentionMask will be used by default. [TensorRT-LLM][WARNING] CrossAttentionMask is not provided for the generation request. Full valid attentionMask will be used by default. [TensorRT-LLM][WARNING] CrossAttentionMask is not provided for the generation request. Full valid attentionMask will be used by default. [TensorRT-LLM][WARNING] CrossAttentionMask is not provided for the generation request. Full valid attentionMask will be used by default. [TensorRT-LLM][WARNING] CrossAttentionMask is not provided for the generation request. Full valid attentionMask will be used by default. [TensorRT-LLM][WARNING] CrossAttentionMask is not provided for the generation request. Full valid attentionMask will be used by default. [TensorRT-LLM][WARNING] CrossAttentionMask is not provided for the generation request. Full valid attentionMask will be used by default. [TensorRT-LLM][WARNING] CrossAttentionMask is not provided for the generation request. Full valid attentionMask will be used by default. [TensorRT-LLM][WARNING] CrossAttentionMask is not provided for the generation request. Full valid attentionMask will be used by default. [TensorRT-LLM][WARNING] CrossAttentionMask is not provided for the generation request. Full valid attentionMask will be used by default. [TensorRT-LLM][WARNING] CrossAttentionMask is not provided for the generation request. Full valid attentionMask will be used by default. [TensorRT-LLM][WARNING] CrossAttentionMask is not provided for the generation request. Full valid attentionMask will be used by default. [TensorRT-LLM][WARNING] CrossAttentionMask is not provided for the generation request. Full valid attentionMask will be used by default. [TensorRT-LLM][WARNING] CrossAttentionMask is not provided for the generation request. Full valid attentionMask will be used by default. [TensorRT-LLM][WARNING] CrossAttentionMask is not provided for the generation request. Full valid attentionMask will be used by default. [TensorRT-LLM][WARNING] CrossAttentionMask is not provided for the generation request. Full valid attentionMask will be used by default. [TensorRT-LLM][WARNING] CrossAttentionMask is not provided for the generation request. Full valid attentionMask will be used by default. [TensorRT-LLM][WARNING] CrossAttentionMask is not provided for the generation request. Full valid attentionMask will be used by default. [TensorRT-LLM][WARNING] CrossAttentionMask is not provided for the generation request. Full valid attentionMask will be used by default.
To fix it I managed to find out I can do this:
python3 run_server.py \
--port 9090 \
--backend tensorrt \
--trt_model_path "/app/TensorRT-LLM-examples/whisper/whisper_large-v3_float16" \
--trt_multilingual \
--max_clients 1 \
--max_connection_time 600 \
--trt_py_session
/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:61: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
import pynvml # type: ignore[import]
INFO:root:Custom model option was provided. Switching to single model mode.
INFO:websockets.server:connection open
INFO:root:New client connected
<frozen importlib._bootstrap_external>:1184: FutureWarning: The cuda.cuda module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.driver module instead.
<frozen importlib._bootstrap_external>:1184: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
[TensorRT-LLM] TensorRT-LLM version: 0.18.2
INFO:root:[INFO:] Warming up TensorRT engine..
/usr/local/lib/python3.10/dist-packages/torch/nested/__init__.py:228: UserWarning: The PyTorch API of nested tensors is in prototype stage and will change in the near future. We recommend specifying layout=torch.jagged when constructing a nested tensor, as this layout receives active development, has better operator coverage, and works with torch.compile. (Triggered internally at /pytorch/aten/src/ATen/NestedTensorImpl.cpp:178.)
return _nested.nested_tensor(
INFO:root:Running TensorRT backend.
INFO:root:[WhisperTensorRT:] Processing audio with duration: 0.512
INFO:root:[WhisperTensorRT:] Processing audio with duration: 0.512
INFO:root:[WhisperTensorRT:] Processing audio with duration: 0.768
INFO:root:[WhisperTensorRT:] Processing audio with duration: 0.768
...
INFO:root:[WhisperTensorRT:] Processing audio with duration: 2.41
INFO:root:Cleaning up.
ERROR:root:[ERROR]: Sending data to client: sent 1000 (OK); then received 1000 (OK)
INFO:root:Exiting speech to text thread
My test.py output.
For instance, if I am recalling an incident very
vividly I go back to the instant of its occurrence. I become
absent-minded, as you say. I jump back for a moment.
[ERROR] WebSocket Error: fin=1 opcode=8 data=b'\x03\xe8'
[INFO]: Websocket connection closed: None: None
I would be curious to learn how it works as there are a lot of errors and c++ version don't work and there are ton of package is deprecated warnings along whole process.