PETR
PETR copied to clipboard
some question about position encoding in depth axis
Hi, thanks for sharing so wonderful work, after reading the paper, I have some maybe stupid question, how about set D=1? With my understanding, for each position in different view image's featuremap, we should use position encoding to distinguish the position from which view, so why you must use D's depth, even with D=1 depth, after coordinate transfom using image extrin matrix, I think it's very easy to jugde the position comes from which view. So my question is what's the difference when set D = 1 or D > 1, as I don't see the ablation study.
Hi, In the early stage of our development, we have conducted experiment with D=1(depth = 1.0m). In this case, 3D PE encodes the direction vector of a line. The result will be lower than sampling 64 points from a line, about 1%~2%. In my opinion,sample points have some advantages: (1)Sampling points are some actual spatial coordinates, and query is also generated by reference points, which can speed up convergence. If we only sample 1 point, it is difficult to determine the depth value of this point (depth = 1 or depth estimation). (2)Since the points are some actual spatial coordinates,we can transform the points of previous frame to current frame. So we can extend the 3D PE to its temperal version. If we use the direction vector (1 point ), it's difficult to do the alignment between different frames.