tubelet-transformer
tubelet-transformer copied to clipboard
Questions about the code for JHMDB
Thanks for the great work. I have read the code for JHMDB and have some questions: (1) The performance of [email protected] is just 0.72, much lower than the 82.3 that is reported. (2) I also notice that the provided evaluation code for JHMDB is for frame-mAP, rather than video-mAP, because the AP is calculated on frame-level rather than tubelet-level. (3) Although the query number is defined as 10*clip_len, only the predictions of the queries corresponding to the intermediate frame (key_pos) are extracted as the final prediction result during training and testing. In other words, such a pipeline is more like a video object detection where the input is a video clip but the goal is just to predict the object and its class in the middle frame of the input video. I did not find the place that can reveal the properties of the so called tubelet transformer. In summary, is some configurations wrong with the current code?