Learning-Action-Completeness-from-Points icon indicating copy to clipboard operation
Learning-Action-Completeness-from-Points copied to clipboard

about thumos14 label

Open menghuaa opened this issue 3 years ago • 17 comments

Hello, in thumbos14, CliffDiving is a subclass of Diving, and the action instances of CliffDiving in the annotation file also belong to Diving. Why don't you use this prior knowledge to remove the action instance of CliffDiving class in the Diving class during training and add a Diving class for each predicted CliffDiving action instance during post-processing? I think an action instance belonging to two categories may make the training difficult to converge.

menghuaa avatar Apr 27 '22 01:04 menghuaa

Thanks for your suggestion!

In fact, I have noticed some papers on fully-supervised temporal action localization that use such a label engineering technique.

However, to my knowledge, existing weakly-supervised approaches do not use it.

Therefore, we did not adopt it for a fair comparison with the previous works, although it may bring some performance gains.

Pilhyeon avatar Apr 29 '22 09:04 Pilhyeon

谢谢你的建议!

事实上,我注意到一些关于全监督时间动作定位的论文使用了这种标签工程技术。

然而,据我所知,现有的弱监督方法并没有使用它。

因此,我们没有采用它来与之前的作品进行公平的比较,尽管它可能会带来一些性能提升。

Thanks for your reply. For the point annotation of Thumos14, SFnet provides four annotation files. Are these four files manually annotated? Is the Thumos 14 point annotation uniformly sampled from the ground truth mentioned in your paper generated by yourself or provided by other papers?

menghuaa avatar Apr 29 '22 11:04 menghuaa

As I have stated in the paper, we used the automatically generated point-level labels that are provided by Moltisanti et al. (CVPR'19).

The point-level labels can be found on their project page, specifically the 'train_df_ts_in_gt.csv' file.

Pilhyeon avatar May 02 '22 04:05 Pilhyeon

As I have stated in the paper, we used the automatically generated point-level labels that are provided by Moltisanti et al. (CVPR'19).

The point-level labels can be found on their project page, specifically the 'train_df_ts_in_gt.csv' file.

In the paper,you perform experiments about comparing different label distributions:Manual,Uniform and Gaussian. Where did you get the manual and uniform label?

menghuaa avatar May 02 '22 06:05 menghuaa

The manual labels are provided by SF-Net, while the Uniform-distributed labels are generated using ground-truth intervals in the dataset construction stage before the training starts.

Pilhyeon avatar May 02 '22 07:05 Pilhyeon

The manual labels are provided by SF-Net, while the Uniform-distributed labels are generated using ground-truth intervals in the dataset construction stage before the training starts.

I see the SF-Net,but it provides the four single-frame text files. Are these four files manually annotated?Do you use one of the txt files?

menghuaa avatar May 02 '22 07:05 menghuaa

All the four files contain manually labeled annotations from different annotators. For selection, we followed SF-Net official code that randomly chooses the annotator id for each video in the dataset construction stage.

Pilhyeon avatar May 02 '22 08:05 Pilhyeon

All the four files contain manually labeled annotations from different annotators. For selection, we followed SF-Net official code that randomly chooses the annotator id for each video in the dataset construction stage.

Thanks for your reply. Have you noticed that there is an annotation file THUMOS2.txt that the video of CliffDiving class is not marked with its parent class Diving? But in other annotation files, the videos of CliffDiving class still belong to the parent class Diving.

menghuaa avatar May 02 '22 09:05 menghuaa

I am not sure whether there are any papers that reduce the Cliffdiving class to the Diving class. An example of the opposite case is the WTAL-C codebase, which is widely used as the baselines for many other works. You may check out how others do by navigating their code links here.

Pilhyeon avatar May 03 '22 05:05 Pilhyeon

I am not sure whether there are any papers that reduce the Cliffdiving class to the Diving class. An example of the opposite case is the WTAL-C codebase, which is widely used as the baselines for many other works. You may check out how others do by navigating their code links here. Hi,I find the split_test.txt that you provide lacks three videos,for example video_test_0000270.I want to know the reason.

menghuaa avatar May 11 '22 07:05 menghuaa

I followed the implementation of STPN, where it is mentioned that the test split of THUMOS'14 is the same with SSN.

In the SSN paper, the authors mentioned that "2 falsely annotated videos (“270”,“1496”) in the test set are excluded in evaluation" and they used only 210 testing videos for evaluation.

Pilhyeon avatar May 11 '22 08:05 Pilhyeon

I followed the implementation of STPN, where it is mentioned that the test split of THUMOS'14 is the same with SSN.

In the SSN paper, the authors mentioned that "2 falsely annotated videos (“270”,“1496”) in the test set are excluded in evaluation" and they used only 210 testing videos for evaluation. Thank you very much.

menghuaa avatar May 11 '22 08:05 menghuaa

I followed the implementation of STPN, where it is mentioned that the test split of THUMOS'14 is the same with SSN. In the SSN paper, the authors mentioned that "2 falsely annotated videos (“270”,“1496”) in the test set are excluded in evaluation" and they used only 210 testing videos for evaluation. Thank you very much.

Hello, I'm sorry to bother you. I am a beginner and would like to ask you why some fully supervised methods, such as actionformer, use feature lengths that are inconsistent with the feature lengths you provide. Is it because i3d uses different sampling rates when extracting features?

daidaiershidi avatar Mar 15 '23 08:03 daidaiershidi

I followed the implementation of STPN, where it is mentioned that the test split of THUMOS'14 is the same with SSN. In the SSN paper, the authors mentioned that "2 falsely annotated videos (“270”,“1496”) in the test set are excluded in evaluation" and they used only 210 testing videos for evaluation. Thank you very much.

Hello, I'm sorry to bother you. I am a beginner and would like to ask you why some fully supervised methods, such as actionformer, use feature lengths that are inconsistent with the feature lengths you provide. Is it because i3d uses different sampling rates when extracting features?

The feature lengths depend on the sampling rate and the total number of frames. Actionformer adopts a smaller stride of 4 (vs. 16 for ours) with a video fps of 30 (vs. 25 for ours).

Pilhyeon avatar Mar 15 '23 11:03 Pilhyeon

Hello, I would like to ask that the point label is frame-level, and the video is divided into 16 frames. So how to apply this point-level classification loss? One is a frame and the other is a segment. Looking forward to your reply.

wj0323i avatar Feb 28 '24 06:02 wj0323i

Hello, I would like to ask that the point label is frame-level, and the video is divided into 16 frames. So how to apply this point-level classification loss? One is a frame and the other is a segment. Looking forward to your reply.

The segment, within which the labeled point (frame) falls, is utilized as a positive sample for the point-level loss.

Pilhyeon avatar Feb 28 '24 09:02 Pilhyeon

Hello, I would like to ask that the point label is frame-level, and the video is divided into 16 frames. So how to apply this point-level classification loss? One is a frame and the other is a segment. Looking forward to your reply.

The segment, within which the labeled point (frame) falls, is utilized as a positive sample for the point-level loss.

Thank you for your reply, and I wish you a happy life!

wj0323i avatar Feb 29 '24 07:02 wj0323i