How to generate the features(such as VGGish or ResNet) used by MultimodalVideoTag? This part seems not to be implemented in the repo
Thanks for such a great work! When I run the code , problems occur: how to generate the features(such as VGGish or ResNet) used by MultimodalVideoTag? This part seems not to be implemented in the repo and thus the customized dataset could not be runned correctly.Is there any example to show how to run the repo completly
可以参考FootballAction的特征提取部分:
https://github.com/PaddlePaddle/PaddleVideo/tree/develop/applications/FootballAction#step14--基于pp-tsm的视频特征提取
Thank you for your reply! I find that the ckpt of the PPTSM model may be trained with the Football Dataset(May be I am wrong:-); Could I reuse it to extract the img feat from other datasets? If not, will PaddlePaddle provide pretrained weights for resnet and vggish? Another thing is that the dim of the vggish used in the PPTSM project doesnt meet the need of the MultimodalVideoTag.
and could you tell me which architecture and layer(such as resnet50 or resnet101, avgpool layer or layer4) of resnet and vggish do you use to extract features?
Are there any pipelines that include extracting basic feats to provide?