AdaTAD works worse than expected on IKEA ASM dataset
Hi, I have tried to train a AdaTAD model on IKEA ASM dataset. I followed the THUMOS config using VideoMAE base model.
The final epoch output is:
2024-07-15 09:17:18 Train INFO: [Train]: [059][00050/00126] Loss=0.5143 cls_loss=0.2856 reg_loss=0.2287 lr_backbone=3.9e-05 lr_det=3.9e-05 mem=4993MB
2024-07-15 09:22:19 Train INFO: [Train]: [059][00100/00126] Loss=0.5090 cls_loss=0.2780 reg_loss=0.2310 lr_backbone=3.8e-05 lr_det=3.8e-05 mem=4993MB
2024-07-15 09:25:03 Train INFO: [Train]: [059][00126/00126] Loss=0.5022 cls_loss=0.2728 reg_loss=0.2294 lr_backbone=3.8e-05 lr_det=3.8e-05 mem=4993MB
The evaluation result is:
2024-07-15 09:32:55 Train INFO: Evaluation starts...
2024-07-15 09:32:57 Train INFO: Loaded annotations from validation subset.
2024-07-15 09:32:57 Train INFO: Number of ground truth instances: 0
2024-07-15 09:32:57 Train INFO: Number of predictions: 234000
2024-07-15 09:32:57 Train INFO: Fixed threshold for tiou score: [0.3, 0.4, 0.5, 0.6, 0.7]
2024-07-15 09:32:57 Train INFO: Average-mAP: nan (%)
2024-07-15 09:32:57 Train INFO: mAP at tIoU 0.30 is nan%
2024-07-15 09:32:57 Train INFO: mAP at tIoU 0.40 is nan%
2024-07-15 09:32:57 Train INFO: mAP at tIoU 0.50 is nan%
2024-07-15 09:32:57 Train INFO: mAP at tIoU 0.60 is nan%
2024-07-15 09:32:57 Train INFO: mAP at tIoU 0.70 is nan%
2024-07-15 09:32:57 Train INFO: Training Over...
Using the model to infer a test video, I try to mark the actions at the bottom of the frame (first bar is GT, second bar is predicted). From the snapshot below, we can see that most of the actions are wrong.
When processing the dataset, I remove 'NA' label. No more extra processing. Any idea about how to improve this?
Hi @tongda, please check your ground truth JSON file.
I saw that 2024-07-15 09:32:57 Train INFO: Number of ground truth instances: 0, indicating that there is no ground truth actions.
Yes, you are right. I fixed it and rerun the test.py. Here is the result:
2024-07-16 23:40:32 Test INFO: Loaded annotations from testing subset.
2024-07-16 23:40:32 Test INFO: Number of ground truth instances: 1855
2024-07-16 23:40:32 Test INFO: Number of predictions: 234000
2024-07-16 23:40:32 Test INFO: Fixed threshold for tiou score: [0.3, 0.4, 0.5, 0.6, 0.7]
2024-07-16 23:40:32 Test INFO: Average-mAP: 39.07 (%)
2024-07-16 23:40:32 Test INFO: mAP at tIoU 0.30 is 50.29%
2024-07-16 23:40:32 Test INFO: mAP at tIoU 0.40 is 46.92%
2024-07-16 23:40:32 Test INFO: mAP at tIoU 0.50 is 40.58%
2024-07-16 23:40:32 Test INFO: mAP at tIoU 0.60 is 34.26%
2024-07-16 23:40:32 Test INFO: mAP at tIoU 0.70 is 23.32%
2024-07-16 23:40:32 Test INFO: Testing Over...
This makes sense now. However, I hope to make it better, any suggestions to improve it?
Typically, you can optimize the following 4 hyper-parameters for better performance in end-to-end training.
- the number of feature pyramid levels.
- the weight of regression loss.
- the number of training epochs.
- the learning rate for the adapter.
Let me try to understand these hyper-parameters. Correct me if I am wrong, please.
Typically, you can optimize the following 4 hyper-parameters for better performance in end-to-end training.
- the number of feature pyramid levels. -> larger level means larger receptive field on time axis, so if actions are long, set this to larger value
- the weight of regression loss. -> larger regression loss weight, means model will try to learn start/end time more accurate. but may influence the category loss?
- the number of training epochs. -> more epoch will get better model, but may have overfit risk
- the learning rate for the adapter. -> larger learning rate make model converge faster, but may fall local optimal.
Perfect! Your understanding is completely correct. Since above are hyper-parameters, we need to search them to find the optimal setting given a new dataset.
Thanks for your patience.
I have tried input size as 224. Here is the last epoch log.
2024-07-17 15:13:51 Train INFO: [Train]: [059][00050/00126] Loss=0.2956 cls_loss=0.1436 reg_loss=0.1519 lr_backbone=3.9e-05 lr_det=3.9e-05 mem=7642MB
2024-07-17 15:19:24 Train INFO: [Train]: [059][00100/00126] Loss=0.2961 cls_loss=0.1430 reg_loss=0.1531 lr_backbone=3.8e-05 lr_det=3.8e-05 mem=7642MB
2024-07-17 15:22:14 Train INFO: [Train]: [059][00126/00126] Loss=0.2924 cls_loss=0.1407 reg_loss=0.1518 lr_backbone=3.8e-05 lr_det=3.8e-05 mem=7642MB
And the evaluation result:
2024-07-17 15:31:35 Train INFO: Number of ground truth instances: 1855
2024-07-17 15:31:35 Train INFO: Number of predictions: 234000
2024-07-17 15:31:35 Train INFO: Fixed threshold for tiou score: [0.3, 0.4, 0.5, 0.6, 0.7]
2024-07-17 15:31:35 Train INFO: Average-mAP: 39.18 (%)
2024-07-17 15:31:35 Train INFO: mAP at tIoU 0.30 is 51.95%
2024-07-17 15:31:35 Train INFO: mAP at tIoU 0.40 is 46.34%
2024-07-17 15:31:35 Train INFO: mAP at tIoU 0.50 is 41.52%
2024-07-17 15:31:35 Train INFO: mAP at tIoU 0.60 is 33.16%
2024-07-17 15:31:35 Train INFO: mAP at tIoU 0.70 is 22.95%
There's still a lot of room for improvement. I will leave this issue open and keep update with different hyper-parameter experiments.
I made some change:
- set epoch to 100;
- add a
center_croptransform afterdecord_inittransform, because the video is 1920x1080 and most actions are occured at the center; - change resolution to 224x224;
the trainig curve looks great, converge fast at first and keep going down, :
however, the evaluation look not very good, the best avg-mAP (40.88) is 40-epoch which is the first evaluation. As lower the loss is, as worse the mAP is.
total log is here. log.txt
I wonder that, mAP may not reflect the actual performance. What do you think?
I visualized one of the test video with actions score > 0.3. Actions are added at the bottom, first line is GT and the others are predicted actions.
I feels the result makes sense. Some of the wrong actions are like "pick up back panel / pick up side panel". I think the result is better than last training result.
Thank you for the update!
-
Best validation loss may not correspond to best mAP. Yes. Particularly in end-to-end trained ActionFormer, the best mAP happens in the middle epoch when training with longer epochs. This issue is related to the cosine scheduler setting for the optimizer.
-
About visualization. Your visualization result is pretty good. This visualization makes sense considering the ambiguity of the annotated action boundary and complicated action category.
-
To further improve the results. Since some actions may not be seen when the backbone is trained, and some actions are very similar, such as the pick-up back panel / pick-up side panel, you can also consider fine-tuning the backbone (such as VideoMAE) on the action recognition task using your own dataset. This approach is usually very effective on Ego4D/Epic-Kitchens datasets, since the pretrained videos and new videos are very different.
@sming256 I am confused about the fine-tuning.
..., you can also consider fine-tuning the backbone (such as VideoMAE) on the action recognition task using your own dataset.
What does the "action recognition task" means? Should I generate clips for each action for fine-tuning backbone? Or just unfreeze the back bone in the end-to-end training process?
Well, full fine-tuning will cost much more GPU mem than I can afford (I only have 4090).
From the paper, I found two point:
Using K400 for pretraining, we observe that end-to-end TAD training allows for +5.56 gain. Conversely, using a model already finetuned on EPICKitchens still yields a +2.32 improvement.
More importantly, when we apply the full finetuning on Ego4d-MQ, no performance gain was observed (27.01% mAP).
Maybe I should change the backbone to internvideo?
-
Fine-tuning the backbone on the action recognition task. Just as is commonly done in EPIC-Kitchens, given a TAD dataset, you can trim only the foreground actions from the long video. Each action can then be considered a clip, resulting in an action recognition dataset. This approach is usually very helpful if the pretraining dataset has a large domain gap with the downstream detection dataset. Once you have this fine-tuned backbone, you can then use AdaTAD to further tune it on the detection task.
-
Changing the backbone to InternVideo1 is an option, but I guess the performance is similar to VideoMAE-L. InternVideo2 is too large to use end-to-end training so far.
Closed due to inactivity.
Hello, @sming256. May I ask your help about my custom dataset scenario? There are only 2 classes: normal and stealing. My task is to detect a stealing action (which could last from 1s to 7s) of each person in the video. I wonder what task is more suitable for this scenario. I have two considerations:
- My best bet is that it's a video action classification task. Because, before analyzing the action, I will prepare tracklets (16 or 32 spatio-temporal frames ) for each person (bbox) as a pre-processing stage, then I send each tracklet to the model to classify it as normal or stealing. If any person's tracklets are classified as steal for predefined consecutive times, I will trigger the alarm.
- I track each person and send 768 (or 1536) corresponding spatio-temporal bbox data to ADATAD model to check if it's stealing or normal.
In terms of resource efficiency, I believe video action classification is better choice. So can I use ADATAD as video action classifier at some point in the process?
Overall, what do you recommend I should go about preparing my dataset for the above scenario? Thank you for your time.
Hi @bit-scientist , thanks for your question.
For your dataset, I think it is more suitable to consider it directly as an action classification task, like your solution 1. In this case, maybe you do not need a temporal detection model such as AdaTAD. You can use SlowFast or VideoMAE to classify the streaming tracklets, and get the stealing segments based on some manually designed rules.
If the dataset has more action categories with various durations, a detection model would be more suitable.
@sming256, thank you for the reply, appreciate it. I believe you're right, action classification task seems feasible. In fact, I already trained a VideoMAE model using its HuggingFace API, however due to VideoMAE's license requirements, I can no longer use it further. I also thought using VideoMAE-v2, but ViT-H model on Kinetics-400 benchmarked only +2% Top-1 units, using 17.88 TFLOPs: