MiniCPM-V icon indicating copy to clipboard operation
MiniCPM-V copied to clipboard

💡 [REQUEST] - Simultaneous multimodal inputs

Open Jiltseb opened this issue 10 months ago • 25 comments

起始日期 | Start Date

No response

实现PR | Implementation PR

Can it already generate outputs if audio and video are provided at the same time? I have tried it and it always gets the results of the visual prompt, ignoring the audio part completely.

This is useful if you want to find the video summary of a video with a spoken content, which is very common. Enabling this ensures one forward pass to be sufficient to get both audio and visual summary together. Is the model capable/trained for this as well?

相关Issues | Reference Issues

No response

摘要 | Summary

Can it already generate outputs if audio and video are provided at the same time? I have tried it and it always gets the results of the visual prompt, ignoring the audio part completely.

基本示例 | Basic Example

For example with VLLM this would look like this:

    audio_placeholder = "(<audio>./</audio>)" * 1
    video_placeholder = "(<video>./</video>)" * 1
    multimodal_prompt = "Use transcription and overall acoustic and visual information to write a concise summary of the input containing spoken content."
    msgs = [{'role': 'user', 'content': f'{audio_placeholder}{video_placeholder}\n{multimodal_prompt}'}] # str([task_prompt,audio_part])}]#"(<image>./</image>)" + \ #possible error!

    prompt = tokenizer.apply_chat_template(
        msgs,
        tokenize=False,
        add_generation_prompt=True
    )

    input_data = {
        "prompt": prompt,
        "multi_modal_data": {
            "video": video_part,
            "audio":(audio_part, 16000),
        
        }
    }
    res = llm.generate(input_data, sampling_params=sampling_params)

缺陷 | Drawbacks

I could not see any drawbacks of the proposed method :)

未解决问题 | Unresolved questions

No response

Jiltseb avatar Feb 20 '25 13:02 Jiltseb

@Cuiunbo @bokesyo and others

Jiltseb avatar Feb 20 '25 13:02 Jiltseb

Yes absolutely you can, but you should consider using omni mode. You can refer to this code:

https://github.com/OpenBMB/MiniCPM-o?tab=readme-ov-file#multimodal-live-streaming

import math
import numpy as np
from PIL import Image
from moviepy.editor import VideoFileClip
import tempfile
import librosa
import soundfile as sf
import torch
from transformers import AutoModel, AutoTokenizer

def get_video_chunk_content(video_path, flatten=True):
    video = VideoFileClip(video_path)
    print('video_duration:', video.duration)
    
    with tempfile.NamedTemporaryFile(suffix=".wav", delete=True) as temp_audio_file:
        temp_audio_file_path = temp_audio_file.name
        video.audio.write_audiofile(temp_audio_file_path, codec="pcm_s16le", fps=16000)
        audio_np, sr = librosa.load(temp_audio_file_path, sr=16000, mono=True)
    num_units = math.ceil(video.duration)
    
    # 1 frame + 1s audio chunk
    contents= []
    for i in range(num_units):
        frame = video.get_frame(i+1)
        image = Image.fromarray((frame).astype(np.uint8))
        audio = audio_np[sr*i:sr*(i+1)]
        if flatten:
            contents.extend(["<unit>", image, audio])
        else:
            contents.append(["<unit>", image, audio])
    
    return contents


model = AutoModel.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True,
    attn_implementation='sdpa', torch_dtype=torch.bfloat16)
model = model.eval().cuda()
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True)

model.init_tts()

# If you are using an older version of PyTorch, you might encounter this issue "weight_norm_fwd_first_dim_kernel" not implemented for 'BFloat16', Please convert the TTS to float32 type.
# model.tts.float()

# https://huggingface.co/openbmb/MiniCPM-o-2_6/blob/main/assets/Skiing.mp4
video_path="assets/Skiing.mp4"
sys_msg = model.get_sys_prompt(mode='omni', language='en')
# if use voice clone prompt, please set ref_audio
# ref_audio_path = '/path/to/ref_audio'
# ref_audio, _ = librosa.load(ref_audio_path, sr=16000, mono=True)
# sys_msg = model.get_sys_prompt(ref_audio=ref_audio, mode='omni', language='en')

contents = get_video_chunk_content(video_path)
msg = {"role":"user", "content": contents}
msgs = [sys_msg, msg]

# please set generate_audio=True and output_audio_path to save the tts result
generate_audio = True
output_audio_path = 'output.wav'

res = model.chat(
    msgs=msgs,
    tokenizer=tokenizer,
    sampling=True,
    temperature=0.5,
    max_new_tokens=4096,
    omni_input=True, # please set omni_input=True when omni inference
    use_tts_template=True,
    generate_audio=generate_audio,
    output_audio_path=output_audio_path,
    max_slice_nums=1,
    use_image_id=False,
    return_dict=True
)
print(res)

In MiniCPM-o 2.6, video and audio input follows a pattern called time-division multiplexing (TDM), which means video and audio are first divided into multiple 1 second video and 1 second audio, and each 1 second video and 1 second audio are placed together, to form a 1 second unit, multiple 1 second unit are concatenated to form the final input, instead of N second video and N second audio.

Hope this answer your question!

bokesyo avatar Feb 20 '25 13:02 bokesyo

Thank you very much! I am looking for its support in VLLM, where can one get the equivalent of model.get_sys_prompt(mode='omni', language='en')? LLM() class does not support it.

Jiltseb avatar Feb 20 '25 13:02 Jiltseb

@HwwwwwwwH Is there a doc for the OmniStreaming mode for VLLM? Also In omni mode, some instructions can be challenging to follow. (our training data includes only user inputs that simultaneously contain audio, audio and text, audio and vision, or vision and text. Therefore, you should consider converting your text instructions into speech, may this help ~

Cuiunbo avatar Feb 20 '25 15:02 Cuiunbo

Here's an example for vLLM with different modalities inputs in offline inferencing:

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer
from PIL import Image

model_name = "openbmb/MiniCPM-o-2_6"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
llm = LLM(
    model=model_name, 
    max_model_len=4096,
    max_num_seqs=2,
    trust_remote_code=True,
    disable_mm_preprocessor_cache=True,
    limit_mm_per_prompt={"image": 18} # decrease it if OOM
)

stop_tokens = ['<|im_end|>', '<|endoftext|>']
stop_token_ids = [tokenizer.convert_tokens_to_ids(i) for i in stop_tokens]
sampling_params = SamplingParams(top_p=0.8,
                                 top_k=100,
                                 temperature=0.7, 
                                 max_tokens=512, 
                                 stop_token_ids=stop_token_ids)

image_pattern = "(<image>./</image>)"
video_pattern = "(<video>./</video>)"
audio_pattern = "(<audio>./</audio>)"
# the num of patterns should be the same as num of multimodal inputs 

# single_image
def process_image(image_path):
    image = Image.open(image_path).convert("RGB")
    messages = [{
        'role': 'user',
        'content': f'{image_pattern}\nPlease describe this image in detail.'
    }]
    prompt = tokenizer.apply_chat_template(messages,
                                            tokenize=False,
                                            add_generation_prompt=True)
    outputs = llm.generate({
        "prompt": prompt,
        "multi_modal_data": {
            "image": image
            # [image, image] for multiple images
        }
    }, sampling_params=sampling_params)
    print(outputs[0].outputs[0].text)


# single_video
def process_video(video_path, num_frames=18):
    def uniform_sample(l, n):
        gap = len(l) / n
        idxs = [int(i * gap + gap / 2) for i in range(n)]
        return [l[i] for i in idxs]
    vr = VideoReader(video_path, ctx=cpu(0))
    frame_idx = [i for i in range(0, len(vr))]
    if len(frame_idx) > num_frames:
        frame_idx = uniform_sample(frame_idx, num_frames)
    frames = vr.get_batch(frame_idx).asnumpy()
    frames = [Image.fromarray(v.astype('uint8')) for v in frames]
    messages = [{
        'role': 'user',
        'content': f'{video_pattern}\nPlease describe this video in detail.'
    }]
    prompt = tokenizer.apply_chat_template(messages,
                                            tokenize=False,
                                            add_generation_prompt=True)
    outputs = llm.generate({
        "prompt": prompt,
        "multi_modal_data": {
            "video": frames
            # [frames, frames] for multiple videos
        }
    }, sampling_params=sampling_params)
    print(outputs[0].outputs[0].text) 


def process_audio(audio_path):
    audio_input, _ = librosa.load(audio_path)
    messages = [{
        'role': 'user',
        'content': f"{audio_pattern}\nPlease repeat each user's speech, including voice style and speech content."
    }]
    prompt = tokenizer.apply_chat_template(messages,
                                            tokenize=False,
                                            add_generation_prompt=True,
                                            chat_template=audio_chat_template)
    outputs = llm.generate([
    {
        "prompt": prompt,
        "multi_modal_data": {
            "audio": audio_input
            # [audio_input, audio_input] for multiple audios
        }
    }
    ], sampling_params=sampling_params)
    print(outputs[0].outputs[0].text)

And there're something worth to notice:

  • vLLM has not supported streaming inputs yet.
  • If you're using MiniCPM-o-2_6 for audios or omni inputs, you need to change chat_template, you can find it in modeling_minicpmo.py. Because it's too long, so I choose to not put it here.
  • If you're using omni mode with interleaved inputs, you need to split the inputs by yourself. video_pattern and audio_pattern in the example mean the placeholder for one single input. There might be slight differences with HF outputs, because it's difficult to make it completely compatible with vLLM.
  • For using vLLM server, you can still refer to the above modifications.
  • I'll keep focusing on your problem, feel free to ask anything when you meet any problem about MiniCPM-o-2_6 and vLLM.

HwwwwwwwH avatar Feb 20 '25 16:02 HwwwwwwwH

@HwwwwwwwH Is there a doc for the OmniStreaming mode for VLLM? Also In omni mode, some instructions can be challenging to follow. (our training data includes only user inputs that simultaneously contain audio, audio and text, audio and vision, or vision and text. Therefore, you should consider converting your text instructions into speech, may this help ~

Okay, that is a nice idea, I have a question though: How can I provide this audio prompt? can I just add next to to audio_video_placeholder?

    audio_video_placeholder = "[(<video>./</video>)(<audio>./</audio>)]" * 1
      msgs = [{'role': 'user', 'content': f'{audio_video_placeholder}\n{librosa.load('audio_prompt.wav', sr=16000, mono=True)[0]}'}]

Jiltseb avatar Feb 20 '25 17:02 Jiltseb

I tried Omni mode with transformers as vLLM does not have support for streaming multimodal yet.

Here is the snippet:

video_path="path-to-video"
sys_msg = model.get_sys_prompt(mode='omni', language='en')

ref_audio_path = 'path-to-sys-prompt'#Audio saying "what is the spoken content of the video?" 
ref_audio, _ = librosa.load(ref_audio_path, sr=16000, mono=True)
sys_msg = model.get_sys_prompt(ref_audio=ref_audio, mode='omni', language='en')
contents = get_video_chunk_content(video_path)# same as what you have provided
msg = {"role":"user", "content": contents}
msgs = [sys_msg, msg]

generate_audio = False #I just need text output
res = model.chat(
    msgs=msgs,
    tokenizer=tokenizer,
    sampling=True,
    temperature=0.5,
    max_new_tokens=8192,
    omni_input=True, 
    generate_audio=generate_audio,
    max_slice_nums=1,
    use_image_id=False,
    return_dict=True
)
print(res)

What I am getting is a pure visual summary, even though both image and audio are passed via contents in a TDM fashion. Changes to the reference prompt/removing it did not make a difference. The video is pretty straightforward and works well in audio-only mode.

Any suggestions @Cuiunbo @bokesyo ?

Jiltseb avatar Feb 20 '25 20:02 Jiltseb

@HwwwwwwwH Is there a doc for the OmniStreaming mode for VLLM? Also In omni mode, some instructions can be challenging to follow. (our training data includes only user inputs that simultaneously contain audio, audio and text, audio and vision, or vision and text. Therefore, you should consider converting your text instructions into speech, may this help ~

Okay, that is a nice idea, I have a question though: How can I provide this audio prompt? can I just add next to to audio_video_placeholder?

    audio_video_placeholder = "[(<video>./</video>)(<audio>./</audio>)]" * 1
      msgs = [{'role': 'user', 'content': f'{audio_video_placeholder}\n{librosa.load('audio_prompt.wav', sr=16000, mono=True)[0]}'}]

Emmm, If you use vllm, inference might look like this:

    # frames = [[frame], [frame], [frame], ...] 1 frame for 1 second
    # audios = [audio_frame, audio_frame, audio_frame, ...]
    messages = [{
        'role': 'user',
        'content': f"{[audio_pattern + video_pattern for i in range(len(frames))] }\n Please describe this video."
    }]
    prompt = tokenizer.apply_chat_template(messages,
                                            tokenize=False,
                                            add_generation_prompt=True,
                                            chat_template=audio_chat_template) 
    outputs = llm.generate([
    {
        "prompt": prompt,
        "multi_modal_data": {
            "video": frames, 
            # "video": [[frame], [frame], [frame]] 
            # because 1 video consists of multiple frames, 
            # here it's going to like we split 1 video into multiple videos with each consists of 1 frame. 
            "audio": audios
        }
    }
    ], sampling_params=sampling_params)                                              

HwwwwwwwH avatar Feb 21 '25 00:02 HwwwwwwwH

hey @bokesyo, i've a question wrt to your previous comment.

note: we're not using vLLM. we're using transformers.

  1. do we have to use <unit> even when streaming for speech-to-speech? currently we're dividing incoming audio samples into 1 second chunk (16,000 samples) and calling streaming_prefill iteratively, but we are not using <unit> thing in content list.

  2. do we need to use omni mode even if we care about just text/speech to text/speech. no need of video yet.

here's our current approach btw:

SAMPLE_RATE = 16_000
for chunk_start in range(0, len(audio_samples), SAMPLE_RATE):
    chunk = audio_samples[chunk_start : chunk_start + SAMPLE_RATE]
    if chunk.size < SAMPLE_RATE:
        chunk = np.pad(
            chunk, (0, SAMPLE_RATE - chunk.size), mode="constant"
        )

    msgs = [{"role": data["role"], "content": [chunk]}]
    if self.is_interrupted():
        logger.info("prefill interrupted")
        return

    logger.debug("birajlog prefilling audio chunk")
    self.model.streaming_prefill(
        session_id=self.session_id,
        msgs=msgs,
        tokenizer=self._tokenizer,
    )

biraj-outspeed avatar Feb 21 '25 04:02 biraj-outspeed

@HwwwwwwwH Is there a doc for the OmniStreaming mode for VLLM? Also In omni mode, some instructions can be challenging to follow. (our training data includes only user inputs that simultaneously contain audio, audio and text, audio and vision, or vision and text. Therefore, you should consider converting your text instructions into speech, may this help ~@HwwwwwwwH 对于 VLLM 的 OmniStreaming 模式有文档吗?另外,在 Omni 模式下,一些说明可能难以遵循。(我们的训练数据仅包括同时包含音频、音频和文本、音频和视觉或视觉和文本的用户输入。因此,您应考虑将文本说明转换为语音,或许这会有所帮助 ~)

Okay, that is a nice idea, I have a question though: How can I provide this audio prompt? can I just add next to to audio_video_placeholder?好的,这是一个不错的主意,不过我有一个问题:我该如何提供这个音频提示?我是否可以直接添加到 audio_video_placeholder 旁边?

    audio_video_placeholder = "[(<video>./</video>)(<audio>./</audio>)]" * 1
      msgs = [{'role': 'user', 'content': f'{audio_video_placeholder}\n{librosa.load('audio_prompt.wav', sr=16000, mono=True)[0]}'}]

Emmm, If you use vllm, inference might look like this:嗯,如果你使用 vllm,推理可能看起来是这样的:

# frames = [[frame], [frame], [frame], ...] 1 frame for 1 second
# audios = [audio_frame, audio_frame, audio_frame, ...]
messages = [{
    'role': 'user',
    'content': f"{[audio_pattern + video_pattern for i in range(len(frames))] }\n Please describe this video."
}]
prompt = tokenizer.apply_chat_template(messages,
                                        tokenize=False,
                                        add_generation_prompt=True,
                                        chat_template=audio_chat_template) 
outputs = llm.generate([
{
    "prompt": prompt,
    "multi_modal_data": {
        "video": frames, 
        # "video": [[frame], [frame], [frame]] 
        # because 1 video consists of multiple frames, 
        # here it's going to like we split 1 video into multiple videos with each consists of 1 frame. 
        "audio": audios
    }
}
], sampling_params=sampling_params)

Thank you for your suggestion. This is indeed much faster than directly inputting video. By the way, I would like to ask when voice output can be implemented on vllm.

WangVertex avatar Feb 21 '25 06:02 WangVertex

@HwwwwwwwH Is there a doc for the OmniStreaming mode for VLLM? Also In omni mode, some instructions can be challenging to follow. (our training data includes only user inputs that simultaneously contain audio, audio and text, audio and vision, or vision and text. Therefore, you should consider converting your text instructions into speech, may this help ~@HwwwwwwwH 对于 VLLM 的 OmniStreaming 模式有文档吗?另外,在 Omni 模式下,一些说明可能难以遵循。(我们的训练数据仅包括同时包含音频、音频和文本、音频和视觉或视觉和文本的用户输入。因此,您应考虑将文本说明转换为语音,或许这会有所帮助 ~)

Okay, that is a nice idea, I have a question though: How can I provide this audio prompt? can I just add next to to audio_video_placeholder?好的,这是一个不错的主意,不过我有一个问题:我该如何提供这个音频提示?我是否可以直接添加到 audio_video_placeholder 旁边?

    audio_video_placeholder = "[(<video>./</video>)(<audio>./</audio>)]" * 1
      msgs = [{'role': 'user', 'content': f'{audio_video_placeholder}\n{librosa.load('audio_prompt.wav', sr=16000, mono=True)[0]}'}]

Emmm, If you use vllm, inference might look like this:嗯,如果你使用 vllm,推理可能看起来是这样的:

# frames = [[frame], [frame], [frame], ...] 1 frame for 1 second
# audios = [audio_frame, audio_frame, audio_frame, ...]
messages = [{
    'role': 'user',
    'content': f"{[audio_pattern + video_pattern for i in range(len(frames))] }\n Please describe this video."
}]
prompt = tokenizer.apply_chat_template(messages,
                                        tokenize=False,
                                        add_generation_prompt=True,
                                        chat_template=audio_chat_template) 
outputs = llm.generate([
{
    "prompt": prompt,
    "multi_modal_data": {
        "video": frames, 
        # "video": [[frame], [frame], [frame]] 
        # because 1 video consists of multiple frames, 
        # here it's going to like we split 1 video into multiple videos with each consists of 1 frame. 
        "audio": audios
    }
}
], sampling_params=sampling_params)

Thank you for your suggestion. This is indeed much faster than directly inputting video. By the way, I would like to ask when voice output can be implemented on vllm.

Did you get it working? With VLLM, I have audio and image lists (1 frame, 1 sec each), prompt as mentioned above. Then I am getting an initialization error when calling LLM() with limit_mm_per_prompt = {"image": 100, "audio": 100},, resetting to default could not help either.

[rank0]: AssertionError: The processed dummy data has a total of {'image': 66440, 'video': 66, 'audio': 81000} placeholder tokens, which is not the expected {'image': 66800, 'video': 66, 'audio': 81000} tokens.

Jiltseb avatar Feb 21 '25 10:02 Jiltseb

@HwwwwwwwH Is there a doc for the OmniStreaming mode for VLLM? Also In omni mode, some instructions can be challenging to follow. (our training data includes only user inputs that simultaneously contain audio, audio and text, audio and vision, or vision and text. Therefore, you should consider converting your text instructions into speech, may this help ~@HwwwwwwwH 对于 VLLM 的 OmniStreaming 模式有文档吗?另外,在 Omni 模式下,一些说明可能难以遵循。(我们的训练数据仅包括同时包含音频、音频和文本、音频和视觉或视觉和文本的用户输入。因此,您应考虑将文本说明转换为语音,或许这会有所帮助 ~)关于 VLLM 的 OmniStreaming 模式有文档吗?另外,在 Omni 模式下,一些说明可能难以遵循。(我们的训练数据仅包括同时包含音频、音频和文本、音频和视觉或视觉和文本的用户输入。因此,您应考虑将文本说明转换为语音,或许这会有所帮助 ~)

Okay, that is a nice idea, I have a question though: How can I provide this audio prompt? can I just add next to to audio_video_placeholder?好的,这是一个不错的主意,不过我有一个问题:我该如何提供这个音频提示?我是否可以直接添加到 audio_video_placeholder 旁边?好的,这是一个不错的主意,不过我有一个问题:我该如何提供这个音频提示?我是否可以直接添加到 audio_video_placeholder 旁边?

    audio_video_placeholder = "[(<video>./</video>)(<audio>./</audio>)]" * 1
      msgs = [{'role': 'user', 'content': f'{audio_video_placeholder}\n{librosa.load('audio_prompt.wav', sr=16000, mono=True)[0]}'}]

Emmm, If you use vllm, inference might look like this:嗯,如果你使用 vllm,推理可能看起来是这样的:

# frames = [[frame], [frame], [frame], ...] 1 frame for 1 second
# audios = [audio_frame, audio_frame, audio_frame, ...]
messages = [{
    'role': 'user',
    'content': f"{[audio_pattern + video_pattern for i in range(len(frames))] }\n Please describe this video."
}]
prompt = tokenizer.apply_chat_template(messages,
                                        tokenize=False,
                                        add_generation_prompt=True,
                                        chat_template=audio_chat_template) 
outputs = llm.generate([
{
    "prompt": prompt,
    "multi_modal_data": {
        "video": frames, 
        # "video": [[frame], [frame], [frame]] 
        # because 1 video consists of multiple frames, 
        # here it's going to like we split 1 video into multiple videos with each consists of 1 frame. 
        "audio": audios
    }
}
], sampling_params=sampling_params)

Thank you for your suggestion. This is indeed much faster than directly inputting video. By the way, I would like to ask when voice output can be implemented on vllm.感谢您的建议。这确实比直接输入视频要快得多。顺便问一下,您能否告诉我 vllm 何时能实现语音输出功能。

Did you get it working? With VLLM, I have audio and image lists (1 frame, 1 sec each), prompt as mentioned above.你让它运行成功了吗?使用 VLLM,我有 audioimage 列表(每帧 1 帧,每秒 1 秒),提示如上所述。 Then I am getting an initialization error when calling LLM() with limit_mm_per_prompt = {"image": 100, "audio": 100},, resetting to default could not help either.然后我在调用 LLM() 时遇到了初始化错误,重置为默认设置也无法解决问题。

[rank0]: AssertionError: The processed dummy data has a total of {'image': 66440, 'video': 66, 'audio': 81000} placeholder tokens, which is not the expected {'image': 66800, 'video': 66, 'audio': 81000} tokens.

I have not tried the case where the image and audio are input at the same time, but if it involves the segmented video and audio as input at the same time, you can refer to the following code

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer
import time
import numpy as np
from PIL import Image
import math
import tempfile
import librosa
from moviepy.editor import VideoFileClip

def get_video_chunk_content(video_path):
    video = VideoFileClip(video_path)
    print('video_duration:', video.duration)
    
    with tempfile.NamedTemporaryFile(suffix=".wav", delete=True) as temp_audio_file:
        temp_audio_file_path = temp_audio_file.name
        video.audio.write_audiofile(temp_audio_file_path, codec="pcm_s16le", fps=16000)
        audio_np, sr = librosa.load(temp_audio_file_path, sr=16000, mono=True)
    num_units = math.ceil(video.duration)
    
    # 1 frame + 1s audio chunk
    frames = []
    audios = []
    
    for i in range(num_units):
        frame = video.get_frame(i+1)
        image = Image.fromarray((frame).astype(np.uint8))
        audio = audio_np[sr*i:sr*(i+1)]
        frames.append([frame])
        audios.append(audio)
        
    return frames, audios

WangVertex avatar Feb 24 '25 02:02 WangVertex

Thanks. It runs but does not properly analyze anything from the audio part.

Jiltseb avatar Feb 24 '25 12:02 Jiltseb

Thanks. It runs but does not properly analyze anything from the audio part.

Sry for late! I'll check it for you tomorrow.

HwwwwwwwH avatar Feb 24 '25 12:02 HwwwwwwwH

Thanks. It runs but does not properly analyze anything from the audio part.

Sry for late! I'll check it for you tomorrow.

Thanks, I will wait for it!

I also tried with audio prompts instead of text as per @Cuiunbo suggestion, but the model (vllm/transformers) failed to summarize the audio part/use the audio part for analysis. And for audio prompts the output defaults to Mandarin and only contains visual information. @bokesyo Was the model trained with audio-visual Paris for creating summaries?

Jiltseb avatar Feb 24 '25 13:02 Jiltseb

hey @bokesyo, i've a question wrt to your previous comment.

note: we're not using vLLM. we're using transformers.

  1. do we have to use <unit> even when streaming for speech-to-speech? currently we're dividing incoming audio samples into 1 second chunk (16,000 samples) and calling streaming_prefill iteratively, but we are not using <unit> thing in content list.
  2. do we need to use omni mode even if we care about just text/speech to text/speech. no need of video yet.

here's our current approach btw:

SAMPLE_RATE = 16_000
for chunk_start in range(0, len(audio_samples), SAMPLE_RATE):
    chunk = audio_samples[chunk_start : chunk_start + SAMPLE_RATE]
    if chunk.size < SAMPLE_RATE:
        chunk = np.pad(
            chunk, (0, SAMPLE_RATE - chunk.size), mode="constant"
        )

    msgs = [{"role": data["role"], "content": [chunk]}]
    if self.is_interrupted():
        logger.info("prefill interrupted")
        return

    logger.debug("birajlog prefilling audio chunk")
    self.model.streaming_prefill(
        session_id=self.session_id,
        msgs=msgs,
        tokenizer=self._tokenizer,
    )
  1. No, we don't need to use <unit> in speech-to-speech.
  2. No, we don't need to use omni mode.

Hope it helps!

bokesyo avatar Feb 28 '25 07:02 bokesyo

I think the common case is that the video user provided includes video/audio/text(subtitles) information, and the question (prompt) is provided separately, the question (prompt) should not in the video. It will be better if the omni mode can support prompt alongside with multiple 1 second unit.

caijimin avatar Mar 04 '25 07:03 caijimin

Try this code. I modified get_video_chunk_content to get_video_audio_chunk_content, which now accepts video and audio inputs separately.

I asked the model what animal was in the video, and it answered correctly.

def get_video_audio_chunk_content(video_path, audio_path, flatten=True):
    # 加载视频
    video = VideoFileClip(video_path)
    print('video_duration:', video.duration)

    # 加载音频
    audio_clip = AudioFileClip(audio_path)
    
    # 将音频转换为 numpy 数组
    with tempfile.NamedTemporaryFile(suffix=".wav", delete=True) as temp_audio_file:
        temp_audio_file_path = temp_audio_file.name
        audio_clip.write_audiofile(temp_audio_file_path, codec="pcm_s16le", fps=16000)
        audio_np, sr = librosa.load(temp_audio_file_path, sr=16000, mono=True)

    # num_units = math.ceil(video.duration)
    num_units = int(max(video.duration, audio_clip.duration))    # 取音频、视频中较长时间的部分,不足的部分用 0 填充

    # 1 frame + 1s audio chunk
    contents = []
    for i in range(num_units):
        frame = video.get_frame(i+1)  # 获取第 i+1 秒的帧
        image = Image.fromarray(frame.astype(np.uint8))  # 转换为 PIL 图像
        audio = audio_np[sr*i:sr*(i+1)]  # 获取对应的 1 秒音频片段
        
        if flatten:
            contents.extend(["<unit>", image, audio])
        else:
            contents.append(["<unit>", image, audio])

    return contents

lsy1973 avatar Mar 06 '25 03:03 lsy1973

Your question is "what animal was in the video" in audio format, that is the audio file is your question? Can I specify the question (prompt) in text format?

Try this code. I modified get_video_chunk_content to get_video_audio_chunk_content, which now accepts video and audio inputs separately.

I asked the model what animal was in the video, and it answered correctly.

caijimin avatar Mar 06 '25 03:03 caijimin

Try this code. I modified get_video_chunk_content to get_video_audio_chunk_content, which now accepts video and audio inputs separately.

I asked the model what animal was in the video, and it answered correctly.

def get_video_audio_chunk_content(video_path, audio_path, flatten=True):
    # 加载视频
    video = VideoFileClip(video_path)
    print('video_duration:', video.duration)

    # 加载音频
    audio_clip = AudioFileClip(audio_path)
    
    # 将音频转换为 numpy 数组
    with tempfile.NamedTemporaryFile(suffix=".wav", delete=True) as temp_audio_file:
        temp_audio_file_path = temp_audio_file.name
        audio_clip.write_audiofile(temp_audio_file_path, codec="pcm_s16le", fps=16000)
        audio_np, sr = librosa.load(temp_audio_file_path, sr=16000, mono=True)

    # num_units = math.ceil(video.duration)
    num_units = int(max(video.duration, audio_clip.duration))    # 取音频、视频中较长时间的部分,不足的部分用 0 填充

    # 1 frame + 1s audio chunk
    contents = []
    for i in range(num_units):
        frame = video.get_frame(i+1)  # 获取第 i+1 秒的帧
        image = Image.fromarray(frame.astype(np.uint8))  # 转换为 PIL 图像
        audio = audio_np[sr*i:sr*(i+1)]  # 获取对应的 1 秒音频片段
        
        if flatten:
            contents.extend(["<unit>", image, audio])
        else:
            contents.append(["<unit>", image, audio])

    return contents

It is very likely that the visual features itself reveal that it is a particular animal. What if only the audio cue is available to understand an animal eg: dog barking in the background. What I see with multimodal input both in VLLM and transformers is that it runs but does not properly analyze the audio part. If someone have tested it @bokesyo @HwwwwwwwH please let me know.

Jiltseb avatar Mar 11 '25 17:03 Jiltseb

Your question is "what animal was in the video" in audio format, that is the audio file is your question? Can I specify the question (prompt) in text format?

Try this code. I modified get_video_chunk_content to get_video_audio_chunk_content, which now accepts video and audio inputs separately. I asked the model what animal was in the video, and it answered correctly.

Yes you can, but as I mentioned, the audio part is not analysed properly/not analysed at all in my experiments.

Jiltseb avatar Mar 11 '25 17:03 Jiltseb

Your question is "what animal was in the video" in audio format, that is the audio file is your question? Can I specify the question (prompt) in text format?

Try this code. I modified get_video_chunk_content to get_video_audio_chunk_content, which now accepts video and audio inputs separately. I asked the model what animal was in the video, and it answered correctly.

Yes, i provide a video file and a audio file. audio file: what animal was in the video? video file: a dog is playing

model output: there is a dog

@caijimin

lsy1973 avatar May 14 '25 01:05 lsy1973

get_video_audio_chunk_content(video_path, audio_path, flatten=True) 我只有一个视频文件 第二个参数我不知道如何传递 另外 请给出通过vllm server 推理的例子,比如:(假设的) contents = get_video_audio_chunk_content("t.mp4") messages = [{ 'role': 'user', 'content': f"{contents}\n Please describe this video." }]

chat_response = client.chat.completions.create( model="model", messages=messages, extra_body={ "stop_token_ids": [151645, 151643] } )

print("Chat response content:", chat_response.choices[0].message.content)

whk6688 avatar Aug 21 '25 08:08 whk6688

我改用这个 def get_video_chunk_content(video_path, flatten=True): video = VideoFileClip(video_path) print('video_duration:', video.duration)

with tempfile.NamedTemporaryFile(suffix=".wav", delete=True) as temp_audio_file:
    temp_audio_file_path = temp_audio_file.name
    video.audio.write_audiofile(temp_audio_file_path, codec="pcm_s16le", fps=16000)
    audio_np, sr = librosa.load(temp_audio_file_path, sr=16000, mono=True)
num_units = math.ceil(5)

# 1 frame + 1s audio chunk
contents = []
for i in range(num_units):
    frame = video.get_frame(i + 1)
    image = Image.fromarray((frame).astype(np.uint8))
    audio = audio_np[sr * i:sr * (i + 1)]
    if flatten:
        contents.extend(["<unit>", image, audio])
    else:
        contents.append(["<unit>", image, audio])

return contents

报这个错误 TypeError: Object of type Image is not JSON serializable

whk6688 avatar Aug 21 '25 10:08 whk6688

如果使用 data = { "model": "model", "messages": [ { "role": "system", "content": "You are a helpful assistant.", }, { "role": "user", "content": [ {"type": "text", "text": "请描述这个视频"}, { "type": "video_url", "video_url": { "url": f"data:video/mp4;base64,{video_base64}", }, }, ], }, ], "max_tokens": 200, "temperature": 0, "stop_token_ids": [151645, 151643] }

vllm直接卡死 没有反应了

whk6688 avatar Aug 21 '25 10:08 whk6688

This issue has been without new discussion for quite some time, so I'm closing it now. If you have any questions, please feel free to open a new issue to discuss them.

tc-mb avatar Nov 14 '25 11:11 tc-mb