Singapore-Maritime-Dataset-Frames-Ground-Truth-Generation-and-Statistics
Singapore-Maritime-Dataset-Frames-Ground-Truth-Generation-and-Statistics copied to clipboard
Frame and annotation matching
Hi! I have some questions regarding the matching of annotation files to the generated frames.
I have been working on the VIS videos and annotations. After generating the frames for all onboard and onshore videos with existing GTs, I started generating the annotations as xml files from the ObjectGT.mat files, based on a slightly modified version of your script https://github.com/tilemmpon/Singapore-Maritime-Dataset-Frames-Ground-Truth-Generation-and-Statistics/blob/master/load_mat_into_csv_xml.py.
Similarly, I have been using your script for generating ALL the frames with existing GTs ( 4 onboard and 36 onshore videos) https://github.com/tilemmpon/Singapore-Maritime-Dataset-Frames-Ground-Truth-Generation-and-Statistics/blob/master/Singapore_dataset_frames_generation_and_histograms.ipynb
Here comes my issue.
Nowhere in your scripts or notebooks can I find an actual matching of annotation files to generated frames. This would surely not be a problem if each video generated exactly the amount of frames which are annotated in the GT mat. files. However, I have found this to not be the case using the default cv2.Videocapture ( as in your script ).
Please see my notebook for details (https://nbviewer.jupyter.org/github/landskris/Semester-project-detectron2-models/blob/master/frame_generation_test_issue.ipynb)
Specifically:
- Onboard MVI_0790_VIS_OB generates 410 fewer frames than annotation files. MVI_0799_VIS_OB generates 1 fewer frame than annotation files.
- Onshore MVI_1584_VIS generates 11 fewer frames than annotation files.
The largest discrepancy is definitely MVI_0790_VIS_OB, which generate 600 frames from its 20s video, with 1010 existing GT rows in its mat file. From your utilized objects_onboard.txt file and my own scripts, I can only find 597 non-empty annotation rows from the GT rows.
I do not know how to handle this problem currently. I am not sure about the logic behind changing the fps in cv2.Videocapture for this video in particular to "force" it to generate 1010 frames. Similarly, a frame to annotation matching from only using non-empty GT rows would lead to 3 frames ( 600 - 597) time discrepancy. Thus, it would be unclear which 3 frames in the sequence which should have empty annotations.
Have you handled this problem somehow which I might have missed?
I would strongly appreciate any answers or tips in handling this, as it is currently a severe bottleneck for my project. Thanks!
Hello, thank you for your interest in my repository.
Concerning your issue. When I first wrote this code I checked manually a number of (randomly selected) video frames by annotating the respective objects from the ground truth. The annotated objects matched the objects on the image, thus my assumption that the annotated ground truth frames starts from the first frame. I have not done anything more systematic than this, so a video might indeed have such a problem. The videos have been captured at 30 FPS (see the SMD website) so generating frames with another framerate (your hypothesis 1) would not be correct.
At this point I must point that there are frames (especially in ob-board videos) that do not have a ground truth in the .mat files. This is because these frames do not contain any objects. As a result, there is no 1-to-1 mapping between the frames and the ground truth. For the issue you report, namely more ground truth entries than video frames, I believe that the original video should have had more frames, which were annotated, but for some reason the released version of the video is smaller (*). This results in ground truth entries that do not correspond to a frame. My assumption has been that they "cropping" of the video took place at the end, thus I used only the ground truth corresponding up to the last frame. What I would suggest is to check these videos on the following way:
- Generate the frames
- Annotate each frame with the corresponding ground truth (assuming that the first GT matches the first frame)
- Manually check if the annotated objects match the actual objects seen in the frame.
If they match then just use the ground truth up to the frame you have available. If not, then:
- Try to synchronize the frames to the GT
- Re-generation of the GT for these videos manually yourself.
- Consider not using these videos.
Please keep me updated how you proceed. Unfortunately, I am currently unable to further investigate this into my code due to time contraints.
Best regards, Tilemmpon
(*) The only way to actually find why there is this mismatch is to contact the authors of the SMD directly.
Thank you very much for your thorough reply, I will do a further investigation and update you on my findings. I also agree that the frame-rate approach suggested is not logical and can be discarded.
@landskris were you able to find anything related to this issue? Im seeing a ton of matching issues as well
Closed as completed since no further feedback has been received.