Your current environment

When using the method, it was found that for Chinese, when processing a batch of data, there are always some data that cannot be recognized at all, or the recognized content is completely incorrect.

import time
import librosa  # Make sure to install librosa if you haven't already

from vllm import LLM, SamplingParams
import torch.distributed as dist

# Function to load audio from a local file
def load_audio(file_path):
    audio, sample_rate = librosa.load(file_path, sr=None)  # Load audio with original sample rate
    return audio, sample_rate

# Create a Whisper encoder/decoder model instance
llm = LLM(
    model="/tmp/pycharm123/model/whisper-large-v3",
    max_model_len=448,
    max_num_seqs=400,
    limit_mm_per_prompt={"audio": 1},
    kv_cache_dtype="fp8",
)

# Load your local audio files
mary_had_lamb_audio, mary_had_lamb_sr = load_audio("./zhibo.wav")
winning_call_audio, winning_call_sr = load_audio("./test_16k_15s.wav")
print(mary_had_lamb_audio)
prompts = [
    {
        "prompt": "<|startoftranscript|>中文",
        "multi_modal_data": {
            "audio": (mary_had_lamb_audio, mary_had_lamb_sr),
        },
    }
]

# Create a sampling params object.
sampling_params = SamplingParams(
    temperature=0,
    top_p=1.0,
    max_tokens=200,
)

start = time.time()

# Generate output tokens from the prompts. The output is a list of
# RequestOutput objects that contain the prompt, generated
# text, and other information.
outputs = llm.generate(prompts, sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    encoder_prompt = output.encoder_prompt
    generated_text = output.outputs[0].text
    print(f"Encoder prompt: {encoder_prompt}, \n"
          f"Decoder prompt: {prompt}, \n"
          f"Generated text: {generated_text}\n")
duration = time.time() - start

print("Duration:", duration)
print("RPS:", len(prompts) / duration)
dist.destroy_process_group()

🐛 Describe the bug

When using the method, it was found that for Chinese, when processing a batch of data, there are always some data that cannot be recognized at all, or the recognized content is completely incorrect. `import time import librosa # Make sure to install librosa if you haven't already

from vllm import LLM, SamplingParams import torch.distributed as dist

Function to load audio from a local file

def load_audio(file_path): audio, sample_rate = librosa.load(file_path, sr=None) # Load audio with original sample rate return audio, sample_rate

Create a Whisper encoder/decoder model instance

llm = LLM( model="/tmp/pycharm123/model/whisper-large-v3", max_model_len=448, max_num_seqs=400, limit_mm_per_prompt={"audio": 1}, kv_cache_dtype="fp8", )

Load your local audio files

mary_had_lamb_audio, mary_had_lamb_sr = load_audio("./zhibo.wav") winning_call_audio, winning_call_sr = load_audio("./test_16k_15s.wav") print(mary_had_lamb_audio) prompts = [ { "prompt": "<|startoftranscript|>中文", "multi_modal_data": { "audio": (mary_had_lamb_audio, mary_had_lamb_sr), }, } ]

Create a sampling params object.

sampling_params = SamplingParams( temperature=0, top_p=1.0, max_tokens=200, )

start = time.time()

Generate output tokens from the prompts. The output is a list of

RequestOutput objects that contain the prompt, generated

text, and other information.

outputs = llm.generate(prompts, sampling_params)

Print the outputs.

for output in outputs: prompt = output.prompt encoder_prompt = output.encoder_prompt generated_text = output.outputs[0].text print(f"Encoder prompt: {encoder_prompt}, \n" f"Decoder prompt: {prompt}, \n" f"Generated text: {generated_text}\n") duration = time.time() - start

print("Duration:", duration) print("RPS:", len(prompts) / duration) dist.destroy_process_group()`

Before submitting a new issue...

[x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Feb 17 '25 12:02 fanqiangwei

Please use ``` for code blocks so they are formatted correctly, making the issue more readable

Feb 17 '25 12:02 hmellor

mary_had_lamb_audio, mary_had_lamb_sr = load_audio("./zhibo.wav") winning_call_audio, winning_call_sr = load_audio("./test_16k_15s.wav") print(mary_had_lamb_audio) prompts = [ { "prompt": "<|startoftranscript|>中文", "multi_modal_data": { "audio": (mary_had_lamb_audio, mary_had_lamb_sr), }, is the prompt ok？

Feb 18 '25 02:02 fanqiangwei

{ "prompt": "<|startoftranscript|>", "multi_modal_data": { "audio": (mary_had_lamb_audio, mary_had_lamb_sr), } remove "中文"

Feb 18 '25 05:02 mru4913

If I need to transcribe and recognize Chinese audio content, what should I do? Additionally, how can I obtain timestamp information?

Feb 18 '25 05:02 fanqiangwei

Currently no timestamp supported. In your code just remove "中文" in prompt and the output should be in Chinese without punctuation.

Feb 18 '25 06:02 mru4913

[Bug]: When using the method, it was found that for Chinese, when processing a batch of data, there are always some data that cannot be recognized at all, or the recognized content is completely incorrect.

Your current environment

🐛 Describe the bug

Function to load audio from a local file

Create a Whisper encoder/decoder model instance

Load your local audio files

Create a sampling params object.

Generate output tokens from the prompts. The output is a list of

RequestOutput objects that contain the prompt, generated

text, and other information.

Print the outputs.

Before submitting a new issue...