How to fine tune model for stream inference

Open Merealtea opened this issue 8 months ago • 0 comments

Thank you for you work in stream analysis in MiniCPM-o model. But after I use this mdel for traffic stream LIVE video analysis, I find it gives pool results, and it can't remember the previous input. And it sometimes can't follow instructions well to give structured output. Maybe it`s the way I conduct stream inference wrong, so I also attach my code in the last part.

And now I wonder if there is any suitable way to improve MiniCPM-o model stream inference ability. I checked docs like https://github.com/OpenBMB/MiniCPM-o/blob/main/docs/llamafactory_train_and_infer.md and these docs only provide a way to fine tune based on images or complete video, not video stream. So how can I fine tune model with video stream input and corresponding grounth? Thanks in advance.

Here is the code for video stream analysis:

video_path = "./videos/test_360.mp4"

model = AutoModel.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True,
    attn_implementation='sdpa', torch_dtype=torch.bfloat16)
model = model.eval().cuda()
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True)

model.init_tts()
sys_msg = model.get_sys_prompt(mode='omni', language='en')
# a new conversation need reset session first, it will reset the kv-cache
model.reset_session()

prompt = """
            You are a helpful assistant.
            How many pedestrian in current frame and the previous frame.
            Answer me in the format like this:
            {  'current_num' : number,
                'previous_num' : number } 
            If there is no previous image, then set previous_num as 0
        """
   

sys_msg = {
                "role": "system",
                "content": [
                     prompt 
                ],
            }
    
contents = get_video_chunk_content(video_path, flatten=False)
session_id = '123'
generate_audio = False

# 1. prefill system prompt
res = model.streaming_prefill(
    session_id=session_id,
    msgs=[sys_msg], 
    tokenizer=tokenizer
)

# 2. prefill video/audio chunks
for content in contents:
    # generate a picture with pure color
    msgs = [{"role":"user", "content": content}]
    # image = content[1]
    st = time.time()
    # image.save(f"frame_{color}.jpg")
    res = model.streaming_prefill(
        session_id=session_id,
        msgs=msgs, 
        tokenizer=tokenizer
    )

    # 3. generate
    res = model.streaming_generate(
        session_id=session_id,
        tokenizer=tokenizer,
        temperature=1e-6,
        generate_audio=generate_audio
    )
    
    print("Prefill time:", time.time()-st)

    audios = []
    text = ""
    st = time.time()
    for r in res:
          text += r['text']
    print("text:", text)
    print("time:", time.time()-st)```


# The output will be like:

# text: {  'current_num' : 0,
#     'previous_num' : 0 }<|tts_eos|>
# time: 0.684617280960083
# Prefill time: 0.09046006202697754
# text: {  'current_num' : 0,
#     'previous_num' : 0 }<|tts_eos|>
# time: 0.5194647312164307
# Prefill time: 0.08741331100463867
# text: {  'current_num' : 1,
#    'previous_num' : 0 }<|tts_eos|>
# time: 0.5168383121490479
# Prefill time: 0.08686423301696777
# text: {  'current_num' : 1,
#     'previous_num' : 0 }<|tts_eos|>
# time: 0.5230734348297119

# The ground truth we prefer is:

# text: {  'current_num' : 1,
#     'previous_num' : 0 }<|tts_eos|>
# text: {  'current_num' : 1,
#     'previous_num' : 1 }<|tts_eos|>
# text: {  'current_num' :1,
#    'previous_num' : 1 }<|tts_eos|>
# text: {  'current_num' : 0,
#   'previous_num' : 1 }<|tts_eos|>

Apr 16 '25 05:04 Merealtea