VBench icon indicating copy to clipboard operation
VBench copied to clipboard

Human anomaly raw video dataset

Open thechargedneutron opened this issue 5 months ago • 9 comments

Hi,

Thanks for the excellent work. I am exploring the ViT-Anomaly detection, and want to see the original videos that you use to extract the human, hand and face samples. The cropped dataset is available at git clone https://huggingface.co/datasets/Vchitect/VBench-2.0_human_anomaly but I want to see the actual source video and investigate the performance of YOLO-World. The paper mentions 1000 videos from COCO and 1000 generated videos.

Can you please make that available too?

Thanks in advance.

thechargedneutron avatar Jul 29 '25 01:07 thechargedneutron

Hi, please follow the instruction of https://github.com/Vchitect/VBench/tree/master/VBench-2.0/vbench2/third_party/ViTDetector#steps-to-train to download and unzip the folder. The source video is in the ''src_video''.

zhengdian1 avatar Jul 29 '25 06:07 zhengdian1

Hi @zhengdian1 thanks! I see the source videos. Another question: I was able to train the model using the codebase. To double check if the model is trained correctly, do you have the released checkpoints for checkpoint/human/ckpt29.pth checkpoint/face/ckpt29.pth and checkpoint/hand/ckpt29.pth?

thechargedneutron avatar Jul 29 '25 07:07 thechargedneutron

Yes, please follow the command VBench-2.0/pretrained/anomaly_detector/download.sh

For other models, please see the instruction here: https://github.com/Vchitect/VBench/tree/master/VBench-2.0#pretrained-models

zhengdian1 avatar Jul 29 '25 07:07 zhengdian1

Thanks! That's very helpful.

thechargedneutron avatar Jul 29 '25 07:07 thechargedneutron

Hi, another quick question @zhengdian1 : how did you obtain the images to train from the videos? Did you sample frames randomly from the video? What were the heuristics? And how many frames did you sample?

thechargedneutron avatar Jul 31 '25 05:07 thechargedneutron

Please see our papers for details.

zhengdian1 avatar Jul 31 '25 06:07 zhengdian1

Hi @zhengdian1 , I don't see this detail in the paper. The paper mentions that frames are extracted from 1000 real and 1000 generated videos.

Image

Sorry if I missed anything.

thechargedneutron avatar Jul 31 '25 06:07 thechargedneutron

Hi @zhengdian1 , can you elaborate on the data curation process from the videos? Did you sample random frames, or is there a more systematic way of extracting the frames? Also, which samples did you obtain from the HumanRefiner [19] paper?

thechargedneutron avatar Aug 18 '25 15:08 thechargedneutron

Hi, thanks for your question. First, we used YOLO-World to detect three categories of object patches from all video frames. To remove redundancy, we computed SSIM between patches from the same video and discarded those with similarity above 0.7, then randomly sampled the remaining ones. These patches, combined with those from HumanRefiner, formed our final dataset.

For the HumanRefiner dataset specifically, we selected:

  • Head class: category_id = 1 (negative), 11 (positive)
  • Hand class: category_id = 5 (negative), 15 (positive)
  • Human class: all patches except category_id = 9 were treated as positive and negative samples

Jacky-hate avatar Aug 19 '25 06:08 Jacky-hate