VBench Human anomaly raw video dataset

Hi,

Thanks for the excellent work. I am exploring the ViT-Anomaly detection, and want to see the original videos that you use to extract the human, hand and face samples. The cropped dataset is available at git clone https://huggingface.co/datasets/Vchitect/VBench-2.0_human_anomaly but I want to see the actual source video and investigate the performance of YOLO-World. The paper mentions 1000 videos from COCO and 1000 generated videos.

Can you please make that available too?

Thanks in advance.

Jul 29 '25 01:07 thechargedneutron

Hi, please follow the instruction of https://github.com/Vchitect/VBench/tree/master/VBench-2.0/vbench2/third_party/ViTDetector#steps-to-train to download and unzip the folder. The source video is in the ''src_video''.

Jul 29 '25 06:07 zhengdian1

Hi @zhengdian1 thanks! I see the source videos. Another question: I was able to train the model using the codebase. To double check if the model is trained correctly, do you have the released checkpoints for checkpoint/human/ckpt29.pth checkpoint/face/ckpt29.pth and checkpoint/hand/ckpt29.pth?

Jul 29 '25 07:07 thechargedneutron

Yes, please follow the command VBench-2.0/pretrained/anomaly_detector/download.sh

For other models, please see the instruction here: https://github.com/Vchitect/VBench/tree/master/VBench-2.0#pretrained-models

Jul 29 '25 07:07 zhengdian1

Thanks! That's very helpful.

Jul 29 '25 07:07 thechargedneutron

Hi, another quick question @zhengdian1 : how did you obtain the images to train from the videos? Did you sample frames randomly from the video? What were the heuristics? And how many frames did you sample?

Jul 31 '25 05:07 thechargedneutron

Please see our papers for details.

Jul 31 '25 06:07 zhengdian1

Hi @zhengdian1 , I don't see this detail in the paper. The paper mentions that frames are extracted from 1000 real and 1000 generated videos.

Sorry if I missed anything.

Jul 31 '25 06:07 thechargedneutron

Hi @zhengdian1 , can you elaborate on the data curation process from the videos? Did you sample random frames, or is there a more systematic way of extracting the frames? Also, which samples did you obtain from the HumanRefiner [19] paper?

Aug 18 '25 15:08 thechargedneutron

Hi, thanks for your question. First, we used YOLO-World to detect three categories of object patches from all video frames. To remove redundancy, we computed SSIM between patches from the same video and discarded those with similarity above 0.7, then randomly sampled the remaining ones. These patches, combined with those from HumanRefiner, formed our final dataset.

For the HumanRefiner dataset specifically, we selected:

Head class: category_id = 1 (negative), 11 (positive)
Hand class: category_id = 5 (negative), 15 (positive)
Human class: all patches except category_id = 9 were treated as positive and negative samples

Aug 19 '25 06:08 Jacky-hate