[Bug]: When using the method, it was found that for Chinese, when processing a batch of data, there are always some data that cannot be recognized at all, or the recognized content is completely incorrect.
Your current environment
When using the method, it was found that for Chinese, when processing a batch of data, there are always some data that cannot be recognized at all, or the recognized content is completely incorrect.
import time
import librosa # Make sure to install librosa if you haven't already
from vllm import LLM, SamplingParams
import torch.distributed as dist
# Function to load audio from a local file
def load_audio(file_path):
audio, sample_rate = librosa.load(file_path, sr=None) # Load audio with original sample rate
return audio, sample_rate
# Create a Whisper encoder/decoder model instance
llm = LLM(
model="/tmp/pycharm123/model/whisper-large-v3",
max_model_len=448,
max_num_seqs=400,
limit_mm_per_prompt={"audio": 1},
kv_cache_dtype="fp8",
)
# Load your local audio files
mary_had_lamb_audio, mary_had_lamb_sr = load_audio("./zhibo.wav")
winning_call_audio, winning_call_sr = load_audio("./test_16k_15s.wav")
print(mary_had_lamb_audio)
prompts = [
{
"prompt": "<|startoftranscript|>中文",
"multi_modal_data": {
"audio": (mary_had_lamb_audio, mary_had_lamb_sr),
},
}
]
# Create a sampling params object.
sampling_params = SamplingParams(
temperature=0,
top_p=1.0,
max_tokens=200,
)
start = time.time()
# Generate output tokens from the prompts. The output is a list of
# RequestOutput objects that contain the prompt, generated
# text, and other information.
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
prompt = output.prompt
encoder_prompt = output.encoder_prompt
generated_text = output.outputs[0].text
print(f"Encoder prompt: {encoder_prompt}, \n"
f"Decoder prompt: {prompt}, \n"
f"Generated text: {generated_text}\n")
duration = time.time() - start
print("Duration:", duration)
print("RPS:", len(prompts) / duration)
dist.destroy_process_group()
🐛 Describe the bug
When using the method, it was found that for Chinese, when processing a batch of data, there are always some data that cannot be recognized at all, or the recognized content is completely incorrect. `import time import librosa # Make sure to install librosa if you haven't already
from vllm import LLM, SamplingParams import torch.distributed as dist
Function to load audio from a local file
def load_audio(file_path): audio, sample_rate = librosa.load(file_path, sr=None) # Load audio with original sample rate return audio, sample_rate
Create a Whisper encoder/decoder model instance
llm = LLM( model="/tmp/pycharm123/model/whisper-large-v3", max_model_len=448, max_num_seqs=400, limit_mm_per_prompt={"audio": 1}, kv_cache_dtype="fp8", )
Load your local audio files
mary_had_lamb_audio, mary_had_lamb_sr = load_audio("./zhibo.wav") winning_call_audio, winning_call_sr = load_audio("./test_16k_15s.wav") print(mary_had_lamb_audio) prompts = [ { "prompt": "<|startoftranscript|>中文", "multi_modal_data": { "audio": (mary_had_lamb_audio, mary_had_lamb_sr), }, } ]
Create a sampling params object.
sampling_params = SamplingParams( temperature=0, top_p=1.0, max_tokens=200, )
start = time.time()
Generate output tokens from the prompts. The output is a list of
RequestOutput objects that contain the prompt, generated
text, and other information.
outputs = llm.generate(prompts, sampling_params)
Print the outputs.
for output in outputs: prompt = output.prompt encoder_prompt = output.encoder_prompt generated_text = output.outputs[0].text print(f"Encoder prompt: {encoder_prompt}, \n" f"Decoder prompt: {prompt}, \n" f"Generated text: {generated_text}\n") duration = time.time() - start
print("Duration:", duration) print("RPS:", len(prompts) / duration) dist.destroy_process_group()`
Before submitting a new issue...
- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
Please use ``` for code blocks so they are formatted correctly, making the issue more readable
mary_had_lamb_audio, mary_had_lamb_sr = load_audio("./zhibo.wav") winning_call_audio, winning_call_sr = load_audio("./test_16k_15s.wav") print(mary_had_lamb_audio) prompts = [ { "prompt": "<|startoftranscript|>中文", "multi_modal_data": { "audio": (mary_had_lamb_audio, mary_had_lamb_sr), }, is the prompt ok?
{ "prompt": "<|startoftranscript|>", "multi_modal_data": { "audio": (mary_had_lamb_audio, mary_had_lamb_sr), } remove "中文"
If I need to transcribe and recognize Chinese audio content, what should I do? Additionally, how can I obtain timestamp information?
Currently no timestamp supported. In your code just remove "中文" in prompt and the output should be in Chinese without punctuation.