YOLO-World inference on video

I am trying to use a video frame as the input. However, i found that the code uses image path as an argument to feed into the runner. Is it possible to pass the frame instead of having to save each frame as an image then use it as input?

Mar 25 '24 09:03 HeChengHui

I am trying to use a video frame as the input. However, i found that the code uses image path as an argument to feed into the runner.我正在尝试使用视频帧作为输入。但是，我发现代码使用图像路径作为参数来馈送到运行器中。 Is it possible to pass the frame instead of having to save each frame as an image then use it as input?是否可以传递帧，而不必将每个帧保存为图像，然后将其用作输入？

I also have the same demand, have you solved it?

Mar 25 '24 09:03 KingBoyAndGirl

I have the same question. Is there any solution available? Thank you.

Apr 18 '24 15:04 LLH-Harward

@LLH-Harward @KingBoyAndGirl 我用下面的代码规避这个问题。我在内存中建立了一个虚拟的文件路径tmp_filename，这样runner就不需要经过磁盘了。 # 使用OPENCV读取视频帧，得到帧为numpy数组，将 numpy 数组转换为 PIL 图像对象 pil_image = Image.fromarray(image) # 保存 PIL 图像到指定路径 #pil_image.save(image_path) with tempfile.NamedTemporaryFile(delete=False, suffix='.png') as tmp_file: # 保存图像到临时文件 pil_image.save(tmp_file, format='PNG') tmp_filename = tmp_file.name texts = [[t.strip()] for t in text.split(',')] + [[' ']] data_info = dict(img_id=0, img_path=tmp_filename, texts=texts)

Apr 26 '24 06:04 tomgotjack

Hi all (@HeChengHui, @KingBoyAndGirl, @LLH-Harward, @tomgotjack), the latest update has supported video inference. You can have a try! See demo/video_demo.py.

Apr 28 '24 08:04 wondervictor

@wondervictor 我运行了你提供的 deploy/onnx_demo.py，当代码运行到： for frame in track_iter_progress(video_reader): 这里会产生如下报错：

Traceback (most recent call last): File "E:\YOLO\YOLO-World\video_demo.py", line 148, in main() File "E:\YOLO\YOLO-World\video_demo.py", line 113, in main for frame in track_iter_progress(video_reader): File "D:\miniconda3\envs\yolo\lib\site-packages\mmengine\utils\progressbar.py", line 240, in track_iter_progress raise TypeError( TypeError: "tasks" must be a tuple object or a sequence object, but got <class 'mmcv.video.io.VideoReader'>

我将其替换为：

for frame in video_reader:

代码成功运行，但效率很低。我输入了一个4分钟，共5690帧1080P的视频，推理完需要2054.7499754428864 秒，也就是34分钟。有什么办法提升效率吗？

Apr 28 '24 10:04 tomgotjack

可以这样修改。 frames = [frame for frame in video_reader]

for frame in track_iter_progress(frames, file=sys.stdout): 我10s的视频跑了跑了116s

用inference库提供的v2-x跑起来很快，但是不支持新出来的权重。

Apr 29 '24 02:04 LLH-Harward

@LLH-Harward 你好，想问下用inference库提供的v2-x怎么跑？我这里使用的是自己微调之后的模型

Apr 29 '24 02:04 tomgotjack

您好，可以参照这个用supervision+inference实现。但是我没找到inference如何加载自己微调之后的模型。如果有，还请您也告知下我。 https://huggingface.co/spaces/SkalskiP/YOLO-World/tree/main

Apr 29 '24 03:04 LLH-Harward

@LLH-Harward 谢谢，我后面看一下这个。目前我做了一个简单的界面，可以加载视频或者调用摄像头，不过分辨率只有240P，效果如下： https://www.bilibili.com/video/BV14T421X72d/?spm_id_from=333.1365.list.card_archive.click&vd_source=0c335752a9ae5c749d91670cca8575ac

Apr 29 '24 03:04 tomgotjack

好的请问下为什么目前分辨率只能240p呢？

Apr 29 '24 03:04 LLH-Harward

模型推理速度和图片分辨率有关。我实测下来240P图片可以0.09S推理，而1080P图片推理就要0.33S。我想调用摄像头，就需要做成实时推理，对速度要求比较高。用240P大概能做到每秒10帧。再配合抽帧，就能实现一个勉强能看的效果。如果提升分辨率，就卡的没法看了。我的显卡是2060，如果换用好的显卡，推理速度变快，就能提升分辨率了

Apr 29 '24 04:04 tomgotjack

明白了多谢

Apr 29 '24 04:04 LLH-Harward

@LLH-Harward @tomgotjack 晚些时候我会提供一些解决方案优化这部分的速度。

Apr 29 '24 09:04 wondervictor

好的，非常感谢。另外有一个新的问题：我使用以下命令进行推理的时候，输出的output.mp4里面没有按照我输入的text-prompt"person,book,laptop,bottle,ipad,pen,phone,bag"进行检测，使用了可能是“lvis或者obj365”的类别。(检测出很多lamp类)

请问是哪有问题？我是要修改config文件中的相关json才能是yoloworld按我输入的text推理吗？

python video_demo.py D:\YOLO-World-master\configs\pretrain\yolo_world_v2_x_vlpan_bn_2e-3_100e_4x8gpus_obj365v1_goldg_train_1280ft_lvis_minival.py D:\YOLO-World-master\pretrained_weights\yolo_world_v2_x_obj365v1_goldg_cc3mlite_pretrain_1280ft-14996a36.pth D:\YOLO-World-master\data\demo_cut.mp4 "person,book,laptop,bottle,ipad,pen,phone,bag" --out output.mp4 --score-thr 0.3

Apr 29 '24 09:04 LLH-Harward

@LLH-Harward 我遇到了同样的问题，不过光顾着测速度给忘了。我用一段视频做了人车两个类别的测试，目标检测类别是对的，但输出类别是["person"]和["bicycle"]。这两个类别恰好是COCO80个类别的前两个，你可以从这里寻找一下原因

Apr 29 '24 09:04 tomgotjack

似乎是visualizer的问题 visualizer从checkpoint中获得dataset_meta visualizer = VISUALIZERS.build(model.cfg.visualizer) # the dataset_meta is loaded from the checkpoint and # then pass to the model in init_detector visualizer.dataset_meta = model.dataset_meta

我在mmdet\visualization\local_visualizer.py中看到 classes = self.dataset_meta.get('classes', None)

所以在visualizer中的classes直接是取的预训练中的，而不是给出的texts @wondervictor @tomgotjack

Apr 29 '24 14:04 LLH-Harward

这个Visualizer可以更改，另外，我准备在即将的更新中不再使用Visualizer

Apr 29 '24 15:04 wondervictor

好的，期待您的下一版更新！

Apr 30 '24 01:04 LLH-Harward

@wondervictor Thank you for adding video support!

I have questions regarding the model.reparameterize(texts).

Do i have to run this command for every frame? does the model not get configured to that text?
If i have 2 class to detect; A & B, do i have to run that command every time i detect a different class?

May 27 '24 09:05 HeChengHui

可以这样修改。 frames = [frame for frame in video_reader]

for frame in track_iter_progress(frames, file=sys.stdout): 我10s的视频跑了跑了116s

用inference库提供的v2-x跑起来很快，但是不支持新出来的权重。

你好，请问这里的 file=sys.stdout 指的是什么

Mar 11 '25 02:03 moonlightnoodles

问了一下deepseek，结果如下： sys.stdout 是 Python 标准库 sys 模块中的一个对象，表示标准输出流（通常是屏幕）。它用于向控制台输出文本。

在 track_iter_progress(frames, file=sys.stdout) 中，file=sys.stdout 指定了进度信息输出的位置。默认情况下，进度信息会打印到控制台。如果你想将输出重定向到其他地方（如文件），可以将 file 参数设置为其他文件对象。例如：

with open('output.txt', 'w') as f:
    track_iter_progress(frames, file=f)

这样，进度信息会写入 output.txt 文件，而不是显示在控制台上。

这个项目过去太久，我也没有印象了，你看看ai的解释吧

发送自我的盖乐世

-------- 原始信息 -------- 发件人： moonlightnoodles @.> 日期: 2025/3/11 10:47 (GMT+08:00) 收件人： AILab-CVC/YOLO-World @.> 抄送： tomgotjack @.>, Mention @.> 主题： Re: [AILab-CVC/YOLO-World] inference on video (Issue #182)

可以这样修改。 frames = [frame for frame in video_reader]

for frame in track_iter_progress(frames, file=sys.stdout): 我10s的视频跑了跑了116s image.png (view on web)https://github.com/AILab-CVC/YOLO-World/assets/78243429/f4c7714c-2366-495d-aaf2-b3e1ce036b30

用inference库提供的v2-x跑起来很快，但是不支持新出来的权重。

你好，请问这里的 file=sys.stdout 指的是什么

― Reply to this email directly, view it on GitHubhttps://github.com/AILab-CVC/YOLO-World/issues/182#issuecomment-2712372609, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AMJDFMWJCSXB22HKL5JAW7D2TZFDNAVCNFSM6AAAAABYXXACHKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDOMJSGM3TENRQHE. You are receiving this because you were mentioned.Message ID: @.***>

[moonlightnoodles]moonlightnoodles left a comment (AILab-CVC/YOLO-World#182)https://github.com/AILab-CVC/YOLO-World/issues/182#issuecomment-2712372609

可以这样修改。 frames = [frame for frame in video_reader]

for frame in track_iter_progress(frames, file=sys.stdout): 我10s的视频跑了跑了116s image.png (view on web)https://github.com/AILab-CVC/YOLO-World/assets/78243429/f4c7714c-2366-495d-aaf2-b3e1ce036b30

用inference库提供的v2-x跑起来很快，但是不支持新出来的权重。

你好，请问这里的 file=sys.stdout 指的是什么

― Reply to this email directly, view it on GitHubhttps://github.com/AILab-CVC/YOLO-World/issues/182#issuecomment-2712372609, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AMJDFMWJCSXB22HKL5JAW7D2TZFDNAVCNFSM6AAAAABYXXACHKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDOMJSGM3TENRQHE. You are receiving this because you were mentioned.Message ID: @.***>

Mar 11 '25 02:03 tomgotjack

问了一下deepseek，结果如下： sys.stdout 是 Python 标准库 sys 模块中的一个对象，表示标准输出流（通常是屏幕）。它用于向控制台输出文本。

在 track_iter_progress(frames, file=sys.stdout) 中，file=sys.stdout 指定了进度信息输出的位置。默认情况下，进度信息会打印到控制台。如果你想将输出重定向到其他地方（如文件），可以将 file 参数设置为其他文件对象。例如：
with open('output.txt', 'w') as f:
    track_iter_progress(frames, file=f)
这样，进度信息会写入 output.txt 文件，而不是显示在控制台上。

这个项目过去太久，我也没有印象了，你看看ai的解释吧

发送自我的盖乐世 …

回复地好快（我意识到这是个可以自己解决的笨问题，要赶来删除它来着）感谢你的回复

Mar 11 '25 03:03 moonlightnoodles

YOLO-World YOLO-World copied to clipboard

inference on video

YOLO-World
YOLO-World copied to clipboard