Missing sparsely video feature extraction module
To get inference on a custom video dataset we need to sparsely extract video features, the same way as you do to get a good result. That would be great if you can make the module accessible on the repo.
Hi, thanks for the interest. I have uploaded the related code (for reference only). To extract region feature, you need to sample frames in the same way and use the tool provided by BUTD.
Thank you very much for providing them. It would also be good if you could add some documentation to the files and functions so that we can better understand the starting point and steps to follow in order to extract feature properly.
Bascially, you can follow a coarse pipeline: extract_video.py (decode mp4 into frames)->preprocess_feature.py (sample and encode frames into CNN representations)->split_dataset_feat.py(split the feature into train/val/test).
That's so helpful. Thanks for explaining it.
Which mode of 'cafe 'or 'd2' did you use to extract regional features?
Please choose resnet-101 with d2.
@doc-doc It seems that object_align.py does not give a complete method to obtain the bounding box, but directly reads region_8c10b_{}.h5. Is there any complete code that can detect the bounding box and then write it to region_8c10b_{}.h5?