OpenTAD icon indicating copy to clipboard operation
OpenTAD copied to clipboard

AdaTAD works worse than expected on IKEA ASM dataset

Open tongda opened this issue 1 year ago • 11 comments

Hi, I have tried to train a AdaTAD model on IKEA ASM dataset. I followed the THUMOS config using VideoMAE base model.

The final epoch output is:

2024-07-15 09:17:18 Train INFO: [Train]: [059][00050/00126]  Loss=0.5143  cls_loss=0.2856  reg_loss=0.2287  lr_backbone=3.9e-05  lr_det=3.9e-05  mem=4993MB
2024-07-15 09:22:19 Train INFO: [Train]: [059][00100/00126]  Loss=0.5090  cls_loss=0.2780  reg_loss=0.2310  lr_backbone=3.8e-05  lr_det=3.8e-05  mem=4993MB
2024-07-15 09:25:03 Train INFO: [Train]: [059][00126/00126]  Loss=0.5022  cls_loss=0.2728  reg_loss=0.2294  lr_backbone=3.8e-05  lr_det=3.8e-05  mem=4993MB

The evaluation result is:

2024-07-15 09:32:55 Train INFO: Evaluation starts...
2024-07-15 09:32:57 Train INFO: Loaded annotations from validation subset.
2024-07-15 09:32:57 Train INFO: Number of ground truth instances: 0
2024-07-15 09:32:57 Train INFO: Number of predictions: 234000
2024-07-15 09:32:57 Train INFO: Fixed threshold for tiou score: [0.3, 0.4, 0.5, 0.6, 0.7]
2024-07-15 09:32:57 Train INFO: Average-mAP:  nan (%)
2024-07-15 09:32:57 Train INFO: mAP at tIoU 0.30 is  nan%
2024-07-15 09:32:57 Train INFO: mAP at tIoU 0.40 is  nan%
2024-07-15 09:32:57 Train INFO: mAP at tIoU 0.50 is  nan%
2024-07-15 09:32:57 Train INFO: mAP at tIoU 0.60 is  nan%
2024-07-15 09:32:57 Train INFO: mAP at tIoU 0.70 is  nan%
2024-07-15 09:32:57 Train INFO: Training Over...

Using the model to infer a test video, I try to mark the actions at the bottom of the frame (first bar is GT, second bar is predicted). From the snapshot below, we can see that most of the actions are wrong. image

When processing the dataset, I remove 'NA' label. No more extra processing. Any idea about how to improve this?

tongda avatar Jul 16 '24 06:07 tongda

Hi @tongda, please check your ground truth JSON file.

I saw that 2024-07-15 09:32:57 Train INFO: Number of ground truth instances: 0, indicating that there is no ground truth actions.

sming256 avatar Jul 16 '24 15:07 sming256

Yes, you are right. I fixed it and rerun the test.py. Here is the result:

2024-07-16 23:40:32 Test INFO: Loaded annotations from testing subset.
2024-07-16 23:40:32 Test INFO: Number of ground truth instances: 1855
2024-07-16 23:40:32 Test INFO: Number of predictions: 234000
2024-07-16 23:40:32 Test INFO: Fixed threshold for tiou score: [0.3, 0.4, 0.5, 0.6, 0.7]
2024-07-16 23:40:32 Test INFO: Average-mAP: 39.07 (%)
2024-07-16 23:40:32 Test INFO: mAP at tIoU 0.30 is 50.29%
2024-07-16 23:40:32 Test INFO: mAP at tIoU 0.40 is 46.92%
2024-07-16 23:40:32 Test INFO: mAP at tIoU 0.50 is 40.58%
2024-07-16 23:40:32 Test INFO: mAP at tIoU 0.60 is 34.26%
2024-07-16 23:40:32 Test INFO: mAP at tIoU 0.70 is 23.32%
2024-07-16 23:40:32 Test INFO: Testing Over...

This makes sense now. However, I hope to make it better, any suggestions to improve it?

tongda avatar Jul 16 '24 15:07 tongda

Typically, you can optimize the following 4 hyper-parameters for better performance in end-to-end training.

  • the number of feature pyramid levels.
  • the weight of regression loss.
  • the number of training epochs.
  • the learning rate for the adapter.

sming256 avatar Jul 17 '24 00:07 sming256

Let me try to understand these hyper-parameters. Correct me if I am wrong, please.

Typically, you can optimize the following 4 hyper-parameters for better performance in end-to-end training.

  • the number of feature pyramid levels. -> larger level means larger receptive field on time axis, so if actions are long, set this to larger value
  • the weight of regression loss. -> larger regression loss weight, means model will try to learn start/end time more accurate. but may influence the category loss?
  • the number of training epochs. -> more epoch will get better model, but may have overfit risk
  • the learning rate for the adapter. -> larger learning rate make model converge faster, but may fall local optimal.

tongda avatar Jul 17 '24 04:07 tongda

Perfect! Your understanding is completely correct. Since above are hyper-parameters, we need to search them to find the optimal setting given a new dataset.

sming256 avatar Jul 17 '24 13:07 sming256

Thanks for your patience.

I have tried input size as 224. Here is the last epoch log.

2024-07-17 15:13:51 Train INFO: [Train]: [059][00050/00126]  Loss=0.2956  cls_loss=0.1436  reg_loss=0.1519  lr_backbone=3.9e-05  lr_det=3.9e-05  mem=7642MB
2024-07-17 15:19:24 Train INFO: [Train]: [059][00100/00126]  Loss=0.2961  cls_loss=0.1430  reg_loss=0.1531  lr_backbone=3.8e-05  lr_det=3.8e-05  mem=7642MB
2024-07-17 15:22:14 Train INFO: [Train]: [059][00126/00126]  Loss=0.2924  cls_loss=0.1407  reg_loss=0.1518  lr_backbone=3.8e-05  lr_det=3.8e-05  mem=7642MB

And the evaluation result:

2024-07-17 15:31:35 Train INFO: Number of ground truth instances: 1855
2024-07-17 15:31:35 Train INFO: Number of predictions: 234000
2024-07-17 15:31:35 Train INFO: Fixed threshold for tiou score: [0.3, 0.4, 0.5, 0.6, 0.7]
2024-07-17 15:31:35 Train INFO: Average-mAP: 39.18 (%)
2024-07-17 15:31:35 Train INFO: mAP at tIoU 0.30 is 51.95%
2024-07-17 15:31:35 Train INFO: mAP at tIoU 0.40 is 46.34%
2024-07-17 15:31:35 Train INFO: mAP at tIoU 0.50 is 41.52%
2024-07-17 15:31:35 Train INFO: mAP at tIoU 0.60 is 33.16%
2024-07-17 15:31:35 Train INFO: mAP at tIoU 0.70 is 22.95%

There's still a lot of room for improvement. I will leave this issue open and keep update with different hyper-parameter experiments.

tongda avatar Jul 17 '24 14:07 tongda

I made some change:

  1. set epoch to 100;
  2. add a center_crop transform after decord_init transform, because the video is 1920x1080 and most actions are occured at the center;
  3. change resolution to 224x224;

the trainig curve looks great, converge fast at first and keep going down, : image

however, the evaluation look not very good, the best avg-mAP (40.88) is 40-epoch which is the first evaluation. As lower the loss is, as worse the mAP is. image

total log is here. log.txt

I wonder that, mAP may not reflect the actual performance. What do you think?

tongda avatar Jul 19 '24 10:07 tongda

I visualized one of the test video with actions score > 0.3. Actions are added at the bottom, first line is GT and the others are predicted actions.

image

I feels the result makes sense. Some of the wrong actions are like "pick up back panel / pick up side panel". I think the result is better than last training result.

tongda avatar Jul 19 '24 11:07 tongda

Thank you for the update!

  1. Best validation loss may not correspond to best mAP. Yes. Particularly in end-to-end trained ActionFormer, the best mAP happens in the middle epoch when training with longer epochs. This issue is related to the cosine scheduler setting for the optimizer.

  2. About visualization. Your visualization result is pretty good. This visualization makes sense considering the ambiguity of the annotated action boundary and complicated action category.

  3. To further improve the results. Since some actions may not be seen when the backbone is trained, and some actions are very similar, such as the pick-up back panel / pick-up side panel, you can also consider fine-tuning the backbone (such as VideoMAE) on the action recognition task using your own dataset. This approach is usually very effective on Ego4D/Epic-Kitchens datasets, since the pretrained videos and new videos are very different.

sming256 avatar Jul 20 '24 12:07 sming256

@sming256 I am confused about the fine-tuning.

..., you can also consider fine-tuning the backbone (such as VideoMAE) on the action recognition task using your own dataset.

What does the "action recognition task" means? Should I generate clips for each action for fine-tuning backbone? Or just unfreeze the back bone in the end-to-end training process?

Well, full fine-tuning will cost much more GPU mem than I can afford (I only have 4090).

From the paper, I found two point:

Using K400 for pretraining, we observe that end-to-end TAD training allows for +5.56 gain. Conversely, using a model already finetuned on EPICKitchens still yields a +2.32 improvement.

More importantly, when we apply the full finetuning on Ego4d-MQ, no performance gain was observed (27.01% mAP).

Maybe I should change the backbone to internvideo?

tongda avatar Jul 22 '24 10:07 tongda

  1. Fine-tuning the backbone on the action recognition task. Just as is commonly done in EPIC-Kitchens, given a TAD dataset, you can trim only the foreground actions from the long video. Each action can then be considered a clip, resulting in an action recognition dataset. This approach is usually very helpful if the pretraining dataset has a large domain gap with the downstream detection dataset. Once you have this fine-tuned backbone, you can then use AdaTAD to further tune it on the detection task.

  2. Changing the backbone to InternVideo1 is an option, but I guess the performance is similar to VideoMAE-L. InternVideo2 is too large to use end-to-end training so far.

sming256 avatar Jul 25 '24 15:07 sming256

Closed due to inactivity.

sming256 avatar Sep 14 '24 18:09 sming256

Hello, @sming256. May I ask your help about my custom dataset scenario? There are only 2 classes: normal and stealing. My task is to detect a stealing action (which could last from 1s to 7s) of each person in the video. I wonder what task is more suitable for this scenario. I have two considerations:

  1. My best bet is that it's a video action classification task. Because, before analyzing the action, I will prepare tracklets (16 or 32 spatio-temporal frames ) for each person (bbox) as a pre-processing stage, then I send each tracklet to the model to classify it as normal or stealing. If any person's tracklets are classified as steal for predefined consecutive times, I will trigger the alarm.
  2. I track each person and send 768 (or 1536) corresponding spatio-temporal bbox data to ADATAD model to check if it's stealing or normal.

In terms of resource efficiency, I believe video action classification is better choice. So can I use ADATAD as video action classifier at some point in the process?

Overall, what do you recommend I should go about preparing my dataset for the above scenario? Thank you for your time.

bit-scientist avatar Apr 25 '25 02:04 bit-scientist

Hi @bit-scientist , thanks for your question.

For your dataset, I think it is more suitable to consider it directly as an action classification task, like your solution 1. In this case, maybe you do not need a temporal detection model such as AdaTAD. You can use SlowFast or VideoMAE to classify the streaming tracklets, and get the stealing segments based on some manually designed rules.

If the dataset has more action categories with various durations, a detection model would be more suitable.

sming256 avatar Apr 25 '25 05:04 sming256

@sming256, thank you for the reply, appreciate it. I believe you're right, action classification task seems feasible. In fact, I already trained a VideoMAE model using its HuggingFace API, however due to VideoMAE's license requirements, I can no longer use it further. I also thought using VideoMAE-v2, but ViT-H model on Kinetics-400 benchmarked only +2% Top-1 units, using 17.88 TFLOPs:

Image

bit-scientist avatar Apr 25 '25 07:04 bit-scientist