Human anomaly raw video dataset
Hi,
Thanks for the excellent work. I am exploring the ViT-Anomaly detection, and want to see the original videos that you use to extract the human, hand and face samples. The cropped dataset is available at git clone https://huggingface.co/datasets/Vchitect/VBench-2.0_human_anomaly but I want to see the actual source video and investigate the performance of YOLO-World. The paper mentions 1000 videos from COCO and 1000 generated videos.
Can you please make that available too?
Thanks in advance.
Hi, please follow the instruction of https://github.com/Vchitect/VBench/tree/master/VBench-2.0/vbench2/third_party/ViTDetector#steps-to-train to download and unzip the folder. The source video is in the ''src_video''.
Hi @zhengdian1 thanks! I see the source videos. Another question: I was able to train the model using the codebase. To double check if the model is trained correctly, do you have the released checkpoints for checkpoint/human/ckpt29.pth checkpoint/face/ckpt29.pth and checkpoint/hand/ckpt29.pth?
Yes, please follow the command VBench-2.0/pretrained/anomaly_detector/download.sh
For other models, please see the instruction here: https://github.com/Vchitect/VBench/tree/master/VBench-2.0#pretrained-models
Thanks! That's very helpful.
Hi, another quick question @zhengdian1 : how did you obtain the images to train from the videos? Did you sample frames randomly from the video? What were the heuristics? And how many frames did you sample?
Please see our papers for details.
Hi @zhengdian1 , I don't see this detail in the paper. The paper mentions that frames are extracted from 1000 real and 1000 generated videos.
Sorry if I missed anything.
Hi @zhengdian1 , can you elaborate on the data curation process from the videos? Did you sample random frames, or is there a more systematic way of extracting the frames? Also, which samples did you obtain from the HumanRefiner [19] paper?
Hi, thanks for your question. First, we used YOLO-World to detect three categories of object patches from all video frames. To remove redundancy, we computed SSIM between patches from the same video and discarded those with similarity above 0.7, then randomly sampled the remaining ones. These patches, combined with those from HumanRefiner, formed our final dataset.
For the HumanRefiner dataset specifically, we selected:
- Head class: category_id = 1 (negative), 11 (positive)
- Hand class: category_id = 5 (negative), 15 (positive)
- Human class: all patches except category_id = 9 were treated as positive and negative samples