TRACE
TRACE copied to clipboard
Performance very different to Action Genome baselines
Thanks for sharing the nice work!
But I find the performance presented in the paper is very different with methods in "Action Genome: Actions as Compositions of Spatio-temporal Scene Graphs" and "detecting human-object relationships in videos". What are the causes for this?
Actually, when we began this project, we could not reproduce the performance in "Action Genome: Actions as Compositions of Spatio-temporal Scene Graphs". Our model always outperform them a lot.
This is not an individual case. You may refer to this paper: "Spatial-Temporal Transformer for Dynamic Scene Graph Generation", where a similar phenomenon is found.
As for "detecting human-object relationships in videos", I haven't read it yet. I'll reply if I find some clues.
OK, thanks for replying.
I think of one possible situation. AG is actually a HOI dataset. However, the metrics in SGG such as PredCls and SGCls enumerate all possible object pairs (e.g. <shoe, bed> in a scene of person, shoe and bed).
However, our setting is in line with RelDN repo for AG dataset, thereby restricting the subject to be the person, but not for VidVRD.