💡 [REQUEST] - Simultaneous multimodal inputs
起始日期 | Start Date
No response
实现PR | Implementation PR
Can it already generate outputs if audio and video are provided at the same time? I have tried it and it always gets the results of the visual prompt, ignoring the audio part completely.
This is useful if you want to find the video summary of a video with a spoken content, which is very common. Enabling this ensures one forward pass to be sufficient to get both audio and visual summary together. Is the model capable/trained for this as well?
相关Issues | Reference Issues
No response
摘要 | Summary
Can it already generate outputs if audio and video are provided at the same time? I have tried it and it always gets the results of the visual prompt, ignoring the audio part completely.
基本示例 | Basic Example
For example with VLLM this would look like this:
audio_placeholder = "(<audio>./</audio>)" * 1
video_placeholder = "(<video>./</video>)" * 1
multimodal_prompt = "Use transcription and overall acoustic and visual information to write a concise summary of the input containing spoken content."
msgs = [{'role': 'user', 'content': f'{audio_placeholder}{video_placeholder}\n{multimodal_prompt}'}] # str([task_prompt,audio_part])}]#"(<image>./</image>)" + \ #possible error!
prompt = tokenizer.apply_chat_template(
msgs,
tokenize=False,
add_generation_prompt=True
)
input_data = {
"prompt": prompt,
"multi_modal_data": {
"video": video_part,
"audio":(audio_part, 16000),
}
}
res = llm.generate(input_data, sampling_params=sampling_params)
缺陷 | Drawbacks
I could not see any drawbacks of the proposed method :)
未解决问题 | Unresolved questions
No response
@Cuiunbo @bokesyo and others
Yes absolutely you can, but you should consider using omni mode. You can refer to this code:
https://github.com/OpenBMB/MiniCPM-o?tab=readme-ov-file#multimodal-live-streaming
import math
import numpy as np
from PIL import Image
from moviepy.editor import VideoFileClip
import tempfile
import librosa
import soundfile as sf
import torch
from transformers import AutoModel, AutoTokenizer
def get_video_chunk_content(video_path, flatten=True):
video = VideoFileClip(video_path)
print('video_duration:', video.duration)
with tempfile.NamedTemporaryFile(suffix=".wav", delete=True) as temp_audio_file:
temp_audio_file_path = temp_audio_file.name
video.audio.write_audiofile(temp_audio_file_path, codec="pcm_s16le", fps=16000)
audio_np, sr = librosa.load(temp_audio_file_path, sr=16000, mono=True)
num_units = math.ceil(video.duration)
# 1 frame + 1s audio chunk
contents= []
for i in range(num_units):
frame = video.get_frame(i+1)
image = Image.fromarray((frame).astype(np.uint8))
audio = audio_np[sr*i:sr*(i+1)]
if flatten:
contents.extend(["<unit>", image, audio])
else:
contents.append(["<unit>", image, audio])
return contents
model = AutoModel.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True,
attn_implementation='sdpa', torch_dtype=torch.bfloat16)
model = model.eval().cuda()
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True)
model.init_tts()
# If you are using an older version of PyTorch, you might encounter this issue "weight_norm_fwd_first_dim_kernel" not implemented for 'BFloat16', Please convert the TTS to float32 type.
# model.tts.float()
# https://huggingface.co/openbmb/MiniCPM-o-2_6/blob/main/assets/Skiing.mp4
video_path="assets/Skiing.mp4"
sys_msg = model.get_sys_prompt(mode='omni', language='en')
# if use voice clone prompt, please set ref_audio
# ref_audio_path = '/path/to/ref_audio'
# ref_audio, _ = librosa.load(ref_audio_path, sr=16000, mono=True)
# sys_msg = model.get_sys_prompt(ref_audio=ref_audio, mode='omni', language='en')
contents = get_video_chunk_content(video_path)
msg = {"role":"user", "content": contents}
msgs = [sys_msg, msg]
# please set generate_audio=True and output_audio_path to save the tts result
generate_audio = True
output_audio_path = 'output.wav'
res = model.chat(
msgs=msgs,
tokenizer=tokenizer,
sampling=True,
temperature=0.5,
max_new_tokens=4096,
omni_input=True, # please set omni_input=True when omni inference
use_tts_template=True,
generate_audio=generate_audio,
output_audio_path=output_audio_path,
max_slice_nums=1,
use_image_id=False,
return_dict=True
)
print(res)
In MiniCPM-o 2.6, video and audio input follows a pattern called time-division multiplexing (TDM), which means video and audio are first divided into multiple 1 second video and 1 second audio, and each 1 second video and 1 second audio are placed together, to form a 1 second unit, multiple 1 second unit are concatenated to form the final input, instead of N second video and N second audio.
Hope this answer your question!
Thank you very much!
I am looking for its support in VLLM, where can one get the equivalent of model.get_sys_prompt(mode='omni', language='en')? LLM() class does not support it.
@HwwwwwwwH Is there a doc for the OmniStreaming mode for VLLM? Also In omni mode, some instructions can be challenging to follow. (our training data includes only user inputs that simultaneously contain audio, audio and text, audio and vision, or vision and text. Therefore, you should consider converting your text instructions into speech, may this help ~
Here's an example for vLLM with different modalities inputs in offline inferencing:
from vllm import LLM, SamplingParams
from transformers import AutoTokenizer
from PIL import Image
model_name = "openbmb/MiniCPM-o-2_6"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
llm = LLM(
model=model_name,
max_model_len=4096,
max_num_seqs=2,
trust_remote_code=True,
disable_mm_preprocessor_cache=True,
limit_mm_per_prompt={"image": 18} # decrease it if OOM
)
stop_tokens = ['<|im_end|>', '<|endoftext|>']
stop_token_ids = [tokenizer.convert_tokens_to_ids(i) for i in stop_tokens]
sampling_params = SamplingParams(top_p=0.8,
top_k=100,
temperature=0.7,
max_tokens=512,
stop_token_ids=stop_token_ids)
image_pattern = "(<image>./</image>)"
video_pattern = "(<video>./</video>)"
audio_pattern = "(<audio>./</audio>)"
# the num of patterns should be the same as num of multimodal inputs
# single_image
def process_image(image_path):
image = Image.open(image_path).convert("RGB")
messages = [{
'role': 'user',
'content': f'{image_pattern}\nPlease describe this image in detail.'
}]
prompt = tokenizer.apply_chat_template(messages,
tokenize=False,
add_generation_prompt=True)
outputs = llm.generate({
"prompt": prompt,
"multi_modal_data": {
"image": image
# [image, image] for multiple images
}
}, sampling_params=sampling_params)
print(outputs[0].outputs[0].text)
# single_video
def process_video(video_path, num_frames=18):
def uniform_sample(l, n):
gap = len(l) / n
idxs = [int(i * gap + gap / 2) for i in range(n)]
return [l[i] for i in idxs]
vr = VideoReader(video_path, ctx=cpu(0))
frame_idx = [i for i in range(0, len(vr))]
if len(frame_idx) > num_frames:
frame_idx = uniform_sample(frame_idx, num_frames)
frames = vr.get_batch(frame_idx).asnumpy()
frames = [Image.fromarray(v.astype('uint8')) for v in frames]
messages = [{
'role': 'user',
'content': f'{video_pattern}\nPlease describe this video in detail.'
}]
prompt = tokenizer.apply_chat_template(messages,
tokenize=False,
add_generation_prompt=True)
outputs = llm.generate({
"prompt": prompt,
"multi_modal_data": {
"video": frames
# [frames, frames] for multiple videos
}
}, sampling_params=sampling_params)
print(outputs[0].outputs[0].text)
def process_audio(audio_path):
audio_input, _ = librosa.load(audio_path)
messages = [{
'role': 'user',
'content': f"{audio_pattern}\nPlease repeat each user's speech, including voice style and speech content."
}]
prompt = tokenizer.apply_chat_template(messages,
tokenize=False,
add_generation_prompt=True,
chat_template=audio_chat_template)
outputs = llm.generate([
{
"prompt": prompt,
"multi_modal_data": {
"audio": audio_input
# [audio_input, audio_input] for multiple audios
}
}
], sampling_params=sampling_params)
print(outputs[0].outputs[0].text)
And there're something worth to notice:
vLLMhas not supported streaming inputs yet.- If you're using
MiniCPM-o-2_6for audios or omni inputs, you need to changechat_template, you can find it inmodeling_minicpmo.py. Because it's too long, so I choose to not put it here. - If you're using omni mode with interleaved inputs, you need to split the inputs by yourself.
video_patternandaudio_patternin the example mean the placeholder for one single input. There might be slight differences with HF outputs, because it's difficult to make it completely compatible withvLLM. - For using vLLM server, you can still refer to the above modifications.
- I'll keep focusing on your problem, feel free to ask anything when you meet any problem about
MiniCPM-o-2_6andvLLM.
@HwwwwwwwH Is there a doc for the OmniStreaming mode for VLLM? Also In omni mode, some instructions can be challenging to follow. (our training data includes only user inputs that simultaneously contain audio, audio and text, audio and vision, or vision and text. Therefore, you should consider converting your text instructions into speech, may this help ~
Okay, that is a nice idea, I have a question though: How can I provide this audio prompt? can I just add next to to audio_video_placeholder?
audio_video_placeholder = "[(<video>./</video>)(<audio>./</audio>)]" * 1
msgs = [{'role': 'user', 'content': f'{audio_video_placeholder}\n{librosa.load('audio_prompt.wav', sr=16000, mono=True)[0]}'}]
I tried Omni mode with transformers as vLLM does not have support for streaming multimodal yet.
Here is the snippet:
video_path="path-to-video"
sys_msg = model.get_sys_prompt(mode='omni', language='en')
ref_audio_path = 'path-to-sys-prompt'#Audio saying "what is the spoken content of the video?"
ref_audio, _ = librosa.load(ref_audio_path, sr=16000, mono=True)
sys_msg = model.get_sys_prompt(ref_audio=ref_audio, mode='omni', language='en')
contents = get_video_chunk_content(video_path)# same as what you have provided
msg = {"role":"user", "content": contents}
msgs = [sys_msg, msg]
generate_audio = False #I just need text output
res = model.chat(
msgs=msgs,
tokenizer=tokenizer,
sampling=True,
temperature=0.5,
max_new_tokens=8192,
omni_input=True,
generate_audio=generate_audio,
max_slice_nums=1,
use_image_id=False,
return_dict=True
)
print(res)
What I am getting is a pure visual summary, even though both image and audio are passed via contents in a TDM fashion.
Changes to the reference prompt/removing it did not make a difference. The video is pretty straightforward and works well in audio-only mode.
Any suggestions @Cuiunbo @bokesyo ?
@HwwwwwwwH Is there a doc for the OmniStreaming mode for VLLM? Also In omni mode, some instructions can be challenging to follow. (our training data includes only user inputs that simultaneously contain audio, audio and text, audio and vision, or vision and text. Therefore, you should consider converting your text instructions into speech, may this help ~
Okay, that is a nice idea, I have a question though: How can I provide this audio prompt? can I just add next to to audio_video_placeholder?
audio_video_placeholder = "[(<video>./</video>)(<audio>./</audio>)]" * 1 msgs = [{'role': 'user', 'content': f'{audio_video_placeholder}\n{librosa.load('audio_prompt.wav', sr=16000, mono=True)[0]}'}]
Emmm, If you use vllm, inference might look like this:
# frames = [[frame], [frame], [frame], ...] 1 frame for 1 second
# audios = [audio_frame, audio_frame, audio_frame, ...]
messages = [{
'role': 'user',
'content': f"{[audio_pattern + video_pattern for i in range(len(frames))] }\n Please describe this video."
}]
prompt = tokenizer.apply_chat_template(messages,
tokenize=False,
add_generation_prompt=True,
chat_template=audio_chat_template)
outputs = llm.generate([
{
"prompt": prompt,
"multi_modal_data": {
"video": frames,
# "video": [[frame], [frame], [frame]]
# because 1 video consists of multiple frames,
# here it's going to like we split 1 video into multiple videos with each consists of 1 frame.
"audio": audios
}
}
], sampling_params=sampling_params)
hey @bokesyo, i've a question wrt to your previous comment.
note: we're not using vLLM. we're using transformers.
-
do we have to use
<unit>even when streaming for speech-to-speech? currently we're dividing incoming audio samples into 1 second chunk (16,000 samples) and callingstreaming_prefilliteratively, but we are not using<unit>thing in content list. -
do we need to use
omnimode even if we care about just text/speech to text/speech. no need of video yet.
here's our current approach btw:
SAMPLE_RATE = 16_000
for chunk_start in range(0, len(audio_samples), SAMPLE_RATE):
chunk = audio_samples[chunk_start : chunk_start + SAMPLE_RATE]
if chunk.size < SAMPLE_RATE:
chunk = np.pad(
chunk, (0, SAMPLE_RATE - chunk.size), mode="constant"
)
msgs = [{"role": data["role"], "content": [chunk]}]
if self.is_interrupted():
logger.info("prefill interrupted")
return
logger.debug("birajlog prefilling audio chunk")
self.model.streaming_prefill(
session_id=self.session_id,
msgs=msgs,
tokenizer=self._tokenizer,
)
@HwwwwwwwH Is there a doc for the OmniStreaming mode for VLLM? Also In omni mode, some instructions can be challenging to follow. (our training data includes only user inputs that simultaneously contain audio, audio and text, audio and vision, or vision and text. Therefore, you should consider converting your text instructions into speech, may this help ~@HwwwwwwwH 对于 VLLM 的 OmniStreaming 模式有文档吗?另外,在 Omni 模式下,一些说明可能难以遵循。(我们的训练数据仅包括同时包含音频、音频和文本、音频和视觉或视觉和文本的用户输入。因此,您应考虑将文本说明转换为语音,或许这会有所帮助 ~)
Okay, that is a nice idea, I have a question though: How can I provide this audio prompt? can I just add next to to audio_video_placeholder?好的,这是一个不错的主意,不过我有一个问题:我该如何提供这个音频提示?我是否可以直接添加到 audio_video_placeholder 旁边?
audio_video_placeholder = "[(<video>./</video>)(<audio>./</audio>)]" * 1 msgs = [{'role': 'user', 'content': f'{audio_video_placeholder}\n{librosa.load('audio_prompt.wav', sr=16000, mono=True)[0]}'}]Emmm, If you use vllm, inference might look like this:嗯,如果你使用 vllm,推理可能看起来是这样的:
# frames = [[frame], [frame], [frame], ...] 1 frame for 1 second # audios = [audio_frame, audio_frame, audio_frame, ...] messages = [{ 'role': 'user', 'content': f"{[audio_pattern + video_pattern for i in range(len(frames))] }\n Please describe this video." }] prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, chat_template=audio_chat_template) outputs = llm.generate([ { "prompt": prompt, "multi_modal_data": { "video": frames, # "video": [[frame], [frame], [frame]] # because 1 video consists of multiple frames, # here it's going to like we split 1 video into multiple videos with each consists of 1 frame. "audio": audios } } ], sampling_params=sampling_params)
Thank you for your suggestion. This is indeed much faster than directly inputting video. By the way, I would like to ask when voice output can be implemented on vllm.
@HwwwwwwwH Is there a doc for the OmniStreaming mode for VLLM? Also In omni mode, some instructions can be challenging to follow. (our training data includes only user inputs that simultaneously contain audio, audio and text, audio and vision, or vision and text. Therefore, you should consider converting your text instructions into speech, may this help ~@HwwwwwwwH 对于 VLLM 的 OmniStreaming 模式有文档吗?另外,在 Omni 模式下,一些说明可能难以遵循。(我们的训练数据仅包括同时包含音频、音频和文本、音频和视觉或视觉和文本的用户输入。因此,您应考虑将文本说明转换为语音,或许这会有所帮助 ~)
Okay, that is a nice idea, I have a question though: How can I provide this audio prompt? can I just add next to to audio_video_placeholder?好的,这是一个不错的主意,不过我有一个问题:我该如何提供这个音频提示?我是否可以直接添加到 audio_video_placeholder 旁边?
audio_video_placeholder = "[(<video>./</video>)(<audio>./</audio>)]" * 1 msgs = [{'role': 'user', 'content': f'{audio_video_placeholder}\n{librosa.load('audio_prompt.wav', sr=16000, mono=True)[0]}'}]Emmm, If you use vllm, inference might look like this:嗯,如果你使用 vllm,推理可能看起来是这样的:
# frames = [[frame], [frame], [frame], ...] 1 frame for 1 second # audios = [audio_frame, audio_frame, audio_frame, ...] messages = [{ 'role': 'user', 'content': f"{[audio_pattern + video_pattern for i in range(len(frames))] }\n Please describe this video." }] prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, chat_template=audio_chat_template) outputs = llm.generate([ { "prompt": prompt, "multi_modal_data": { "video": frames, # "video": [[frame], [frame], [frame]] # because 1 video consists of multiple frames, # here it's going to like we split 1 video into multiple videos with each consists of 1 frame. "audio": audios } } ], sampling_params=sampling_params)Thank you for your suggestion. This is indeed much faster than directly inputting video. By the way, I would like to ask when voice output can be implemented on vllm.
Did you get it working? With VLLM, I have audio and image lists (1 frame, 1 sec each), prompt as mentioned above.
Then I am getting an initialization error when calling LLM() with limit_mm_per_prompt = {"image": 100, "audio": 100},, resetting to default could not help either.
[rank0]: AssertionError: The processed dummy data has a total of {'image': 66440, 'video': 66, 'audio': 81000} placeholder tokens, which is not the expected {'image': 66800, 'video': 66, 'audio': 81000} tokens.
@HwwwwwwwH Is there a doc for the OmniStreaming mode for VLLM? Also In omni mode, some instructions can be challenging to follow. (our training data includes only user inputs that simultaneously contain audio, audio and text, audio and vision, or vision and text. Therefore, you should consider converting your text instructions into speech, may this help ~@HwwwwwwwH 对于 VLLM 的 OmniStreaming 模式有文档吗?另外,在 Omni 模式下,一些说明可能难以遵循。(我们的训练数据仅包括同时包含音频、音频和文本、音频和视觉或视觉和文本的用户输入。因此,您应考虑将文本说明转换为语音,或许这会有所帮助 ~)关于 VLLM 的 OmniStreaming 模式有文档吗?另外,在 Omni 模式下,一些说明可能难以遵循。(我们的训练数据仅包括同时包含音频、音频和文本、音频和视觉或视觉和文本的用户输入。因此,您应考虑将文本说明转换为语音,或许这会有所帮助 ~)
Okay, that is a nice idea, I have a question though: How can I provide this audio prompt? can I just add next to to audio_video_placeholder?好的,这是一个不错的主意,不过我有一个问题:我该如何提供这个音频提示?我是否可以直接添加到 audio_video_placeholder 旁边?好的,这是一个不错的主意,不过我有一个问题:我该如何提供这个音频提示?我是否可以直接添加到 audio_video_placeholder 旁边?
audio_video_placeholder = "[(<video>./</video>)(<audio>./</audio>)]" * 1 msgs = [{'role': 'user', 'content': f'{audio_video_placeholder}\n{librosa.load('audio_prompt.wav', sr=16000, mono=True)[0]}'}]Emmm, If you use vllm, inference might look like this:嗯,如果你使用 vllm,推理可能看起来是这样的:
# frames = [[frame], [frame], [frame], ...] 1 frame for 1 second # audios = [audio_frame, audio_frame, audio_frame, ...] messages = [{ 'role': 'user', 'content': f"{[audio_pattern + video_pattern for i in range(len(frames))] }\n Please describe this video." }] prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, chat_template=audio_chat_template) outputs = llm.generate([ { "prompt": prompt, "multi_modal_data": { "video": frames, # "video": [[frame], [frame], [frame]] # because 1 video consists of multiple frames, # here it's going to like we split 1 video into multiple videos with each consists of 1 frame. "audio": audios } } ], sampling_params=sampling_params)Thank you for your suggestion. This is indeed much faster than directly inputting video. By the way, I would like to ask when voice output can be implemented on vllm.感谢您的建议。这确实比直接输入视频要快得多。顺便问一下,您能否告诉我 vllm 何时能实现语音输出功能。
Did you get it working? With VLLM, I have
audioandimagelists (1 frame, 1 sec each), prompt as mentioned above.你让它运行成功了吗?使用 VLLM,我有audio和image列表(每帧 1 帧,每秒 1 秒),提示如上所述。 Then I am getting an initialization error when callingLLM()withlimit_mm_per_prompt = {"image": 100, "audio": 100},, resetting to default could not help either.然后我在调用LLM()时遇到了初始化错误,重置为默认设置也无法解决问题。
[rank0]: AssertionError: The processed dummy data has a total of {'image': 66440, 'video': 66, 'audio': 81000} placeholder tokens, which is not the expected {'image': 66800, 'video': 66, 'audio': 81000} tokens.
I have not tried the case where the image and audio are input at the same time, but if it involves the segmented video and audio as input at the same time, you can refer to the following code
from vllm import LLM, SamplingParams
from transformers import AutoTokenizer
import time
import numpy as np
from PIL import Image
import math
import tempfile
import librosa
from moviepy.editor import VideoFileClip
def get_video_chunk_content(video_path):
video = VideoFileClip(video_path)
print('video_duration:', video.duration)
with tempfile.NamedTemporaryFile(suffix=".wav", delete=True) as temp_audio_file:
temp_audio_file_path = temp_audio_file.name
video.audio.write_audiofile(temp_audio_file_path, codec="pcm_s16le", fps=16000)
audio_np, sr = librosa.load(temp_audio_file_path, sr=16000, mono=True)
num_units = math.ceil(video.duration)
# 1 frame + 1s audio chunk
frames = []
audios = []
for i in range(num_units):
frame = video.get_frame(i+1)
image = Image.fromarray((frame).astype(np.uint8))
audio = audio_np[sr*i:sr*(i+1)]
frames.append([frame])
audios.append(audio)
return frames, audios
Thanks. It runs but does not properly analyze anything from the audio part.
Thanks. It runs but does not properly analyze anything from the audio part.
Sry for late! I'll check it for you tomorrow.
Thanks. It runs but does not properly analyze anything from the audio part.
Sry for late! I'll check it for you tomorrow.
Thanks, I will wait for it!
I also tried with audio prompts instead of text as per @Cuiunbo suggestion, but the model (vllm/transformers) failed to summarize the audio part/use the audio part for analysis. And for audio prompts the output defaults to Mandarin and only contains visual information. @bokesyo Was the model trained with audio-visual Paris for creating summaries?
hey @bokesyo, i've a question wrt to your previous comment.
note: we're not using vLLM. we're using transformers.
- do we have to use
<unit>even when streaming for speech-to-speech? currently we're dividing incoming audio samples into 1 second chunk (16,000 samples) and callingstreaming_prefilliteratively, but we are not using<unit>thing in content list.- do we need to use
omnimode even if we care about just text/speech to text/speech. no need of video yet.here's our current approach btw:
SAMPLE_RATE = 16_000 for chunk_start in range(0, len(audio_samples), SAMPLE_RATE): chunk = audio_samples[chunk_start : chunk_start + SAMPLE_RATE] if chunk.size < SAMPLE_RATE: chunk = np.pad( chunk, (0, SAMPLE_RATE - chunk.size), mode="constant" ) msgs = [{"role": data["role"], "content": [chunk]}] if self.is_interrupted(): logger.info("prefill interrupted") return logger.debug("birajlog prefilling audio chunk") self.model.streaming_prefill( session_id=self.session_id, msgs=msgs, tokenizer=self._tokenizer, )
- No, we don't need to use
<unit>in speech-to-speech. - No, we don't need to use
omnimode.
Hope it helps!
I think the common case is that the video user provided includes video/audio/text(subtitles) information, and the question (prompt) is provided separately, the question (prompt) should not in the video. It will be better if the omni mode can support prompt alongside with multiple 1 second unit.
Try this code. I modified get_video_chunk_content to get_video_audio_chunk_content, which now accepts video and audio inputs separately.
I asked the model what animal was in the video, and it answered correctly.
def get_video_audio_chunk_content(video_path, audio_path, flatten=True):
# 加载视频
video = VideoFileClip(video_path)
print('video_duration:', video.duration)
# 加载音频
audio_clip = AudioFileClip(audio_path)
# 将音频转换为 numpy 数组
with tempfile.NamedTemporaryFile(suffix=".wav", delete=True) as temp_audio_file:
temp_audio_file_path = temp_audio_file.name
audio_clip.write_audiofile(temp_audio_file_path, codec="pcm_s16le", fps=16000)
audio_np, sr = librosa.load(temp_audio_file_path, sr=16000, mono=True)
# num_units = math.ceil(video.duration)
num_units = int(max(video.duration, audio_clip.duration)) # 取音频、视频中较长时间的部分,不足的部分用 0 填充
# 1 frame + 1s audio chunk
contents = []
for i in range(num_units):
frame = video.get_frame(i+1) # 获取第 i+1 秒的帧
image = Image.fromarray(frame.astype(np.uint8)) # 转换为 PIL 图像
audio = audio_np[sr*i:sr*(i+1)] # 获取对应的 1 秒音频片段
if flatten:
contents.extend(["<unit>", image, audio])
else:
contents.append(["<unit>", image, audio])
return contents
Your question is "what animal was in the video" in audio format, that is the audio file is your question? Can I specify the question (prompt) in text format?
Try this code. I modified get_video_chunk_content to get_video_audio_chunk_content, which now accepts video and audio inputs separately.
I asked the model what animal was in the video, and it answered correctly.
Try this code. I modified get_video_chunk_content to get_video_audio_chunk_content, which now accepts video and audio inputs separately.
I asked the model what animal was in the video, and it answered correctly.
def get_video_audio_chunk_content(video_path, audio_path, flatten=True): # 加载视频 video = VideoFileClip(video_path) print('video_duration:', video.duration) # 加载音频 audio_clip = AudioFileClip(audio_path) # 将音频转换为 numpy 数组 with tempfile.NamedTemporaryFile(suffix=".wav", delete=True) as temp_audio_file: temp_audio_file_path = temp_audio_file.name audio_clip.write_audiofile(temp_audio_file_path, codec="pcm_s16le", fps=16000) audio_np, sr = librosa.load(temp_audio_file_path, sr=16000, mono=True) # num_units = math.ceil(video.duration) num_units = int(max(video.duration, audio_clip.duration)) # 取音频、视频中较长时间的部分,不足的部分用 0 填充 # 1 frame + 1s audio chunk contents = [] for i in range(num_units): frame = video.get_frame(i+1) # 获取第 i+1 秒的帧 image = Image.fromarray(frame.astype(np.uint8)) # 转换为 PIL 图像 audio = audio_np[sr*i:sr*(i+1)] # 获取对应的 1 秒音频片段 if flatten: contents.extend(["<unit>", image, audio]) else: contents.append(["<unit>", image, audio]) return contents
It is very likely that the visual features itself reveal that it is a particular animal. What if only the audio cue is available to understand an animal eg: dog barking in the background. What I see with multimodal input both in VLLM and transformers is that it runs but does not properly analyze the audio part. If someone have tested it @bokesyo @HwwwwwwwH please let me know.
Your question is "what animal was in the video" in audio format, that is the audio file is your question? Can I specify the question (prompt) in text format?
Try this code. I modified get_video_chunk_content to get_video_audio_chunk_content, which now accepts video and audio inputs separately. I asked the model what animal was in the video, and it answered correctly.
Yes you can, but as I mentioned, the audio part is not analysed properly/not analysed at all in my experiments.
Your question is "what animal was in the video" in audio format, that is the audio file is your question? Can I specify the question (prompt) in text format?
Try this code. I modified get_video_chunk_content to get_video_audio_chunk_content, which now accepts video and audio inputs separately. I asked the model what animal was in the video, and it answered correctly.
Yes, i provide a video file and a audio file. audio file: what animal was in the video? video file: a dog is playing
model output: there is a dog
@caijimin
get_video_audio_chunk_content(video_path, audio_path, flatten=True) 我只有一个视频文件 第二个参数我不知道如何传递 另外 请给出通过vllm server 推理的例子,比如:(假设的) contents = get_video_audio_chunk_content("t.mp4") messages = [{ 'role': 'user', 'content': f"{contents}\n Please describe this video." }]
chat_response = client.chat.completions.create( model="model", messages=messages, extra_body={ "stop_token_ids": [151645, 151643] } )
print("Chat response content:", chat_response.choices[0].message.content)
我改用这个 def get_video_chunk_content(video_path, flatten=True): video = VideoFileClip(video_path) print('video_duration:', video.duration)
with tempfile.NamedTemporaryFile(suffix=".wav", delete=True) as temp_audio_file:
temp_audio_file_path = temp_audio_file.name
video.audio.write_audiofile(temp_audio_file_path, codec="pcm_s16le", fps=16000)
audio_np, sr = librosa.load(temp_audio_file_path, sr=16000, mono=True)
num_units = math.ceil(5)
# 1 frame + 1s audio chunk
contents = []
for i in range(num_units):
frame = video.get_frame(i + 1)
image = Image.fromarray((frame).astype(np.uint8))
audio = audio_np[sr * i:sr * (i + 1)]
if flatten:
contents.extend(["<unit>", image, audio])
else:
contents.append(["<unit>", image, audio])
return contents
报这个错误 TypeError: Object of type Image is not JSON serializable
如果使用 data = { "model": "model", "messages": [ { "role": "system", "content": "You are a helpful assistant.", }, { "role": "user", "content": [ {"type": "text", "text": "请描述这个视频"}, { "type": "video_url", "video_url": { "url": f"data:video/mp4;base64,{video_base64}", }, }, ], }, ], "max_tokens": 200, "temperature": 0, "stop_token_ids": [151645, 151643] }
vllm直接卡死 没有反应了
This issue has been without new discussion for quite some time, so I'm closing it now. If you have any questions, please feel free to open a new issue to discuss them.