How to fine tune model for stream inference
Thank you for you work in stream analysis in MiniCPM-o model. But after I use this mdel for traffic stream LIVE video analysis, I find it gives pool results, and it can't remember the previous input. And it sometimes can't follow instructions well to give structured output. Maybe it`s the way I conduct stream inference wrong, so I also attach my code in the last part.
And now I wonder if there is any suitable way to improve MiniCPM-o model stream inference ability. I checked docs like https://github.com/OpenBMB/MiniCPM-o/blob/main/docs/llamafactory_train_and_infer.md and these docs only provide a way to fine tune based on images or complete video, not video stream. So how can I fine tune model with video stream input and corresponding grounth? Thanks in advance.
Here is the code for video stream analysis:
video_path = "./videos/test_360.mp4"
model = AutoModel.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True,
attn_implementation='sdpa', torch_dtype=torch.bfloat16)
model = model.eval().cuda()
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True)
model.init_tts()
sys_msg = model.get_sys_prompt(mode='omni', language='en')
# a new conversation need reset session first, it will reset the kv-cache
model.reset_session()
prompt = """
You are a helpful assistant.
How many pedestrian in current frame and the previous frame.
Answer me in the format like this:
{ 'current_num' : number,
'previous_num' : number }
If there is no previous image, then set previous_num as 0
"""
sys_msg = {
"role": "system",
"content": [
prompt
],
}
contents = get_video_chunk_content(video_path, flatten=False)
session_id = '123'
generate_audio = False
# 1. prefill system prompt
res = model.streaming_prefill(
session_id=session_id,
msgs=[sys_msg],
tokenizer=tokenizer
)
# 2. prefill video/audio chunks
for content in contents:
# generate a picture with pure color
msgs = [{"role":"user", "content": content}]
# image = content[1]
st = time.time()
# image.save(f"frame_{color}.jpg")
res = model.streaming_prefill(
session_id=session_id,
msgs=msgs,
tokenizer=tokenizer
)
# 3. generate
res = model.streaming_generate(
session_id=session_id,
tokenizer=tokenizer,
temperature=1e-6,
generate_audio=generate_audio
)
print("Prefill time:", time.time()-st)
audios = []
text = ""
st = time.time()
for r in res:
text += r['text']
print("text:", text)
print("time:", time.time()-st)```
# The output will be like:
# text: { 'current_num' : 0,
# 'previous_num' : 0 }<|tts_eos|>
# time: 0.684617280960083
# Prefill time: 0.09046006202697754
# text: { 'current_num' : 0,
# 'previous_num' : 0 }<|tts_eos|>
# time: 0.5194647312164307
# Prefill time: 0.08741331100463867
# text: { 'current_num' : 1,
# 'previous_num' : 0 }<|tts_eos|>
# time: 0.5168383121490479
# Prefill time: 0.08686423301696777
# text: { 'current_num' : 1,
# 'previous_num' : 0 }<|tts_eos|>
# time: 0.5230734348297119
# The ground truth we prefer is:
# text: { 'current_num' : 1,
# 'previous_num' : 0 }<|tts_eos|>
# text: { 'current_num' : 1,
# 'previous_num' : 1 }<|tts_eos|>
# text: { 'current_num' :1,
# 'previous_num' : 1 }<|tts_eos|>
# text: { 'current_num' : 0,
# 'previous_num' : 1 }<|tts_eos|>