PETR some question about position encoding in depth axis

some question about position encoding in depth axis

Open AndyYuan96 opened this issue 2 years ago • 1 comments

Hi, thanks for sharing so wonderful work, after reading the paper, I have some maybe stupid question, how about set D=1? With my understanding, for each position in different view image's featuremap, we should use position encoding to distinguish the position from which view, so why you must use D's depth, even with D=1 depth, after coordinate transfom using image extrin matrix, I think it's very easy to jugde the position comes from which view. So my question is what's the difference when set D = 1 or D > 1, as I don't see the ablation study.

Aug 10 '22 13:08 AndyYuan96

Hi， In the early stage of our development, we have conducted experiment with D=1（depth = 1.0m）. In this case, 3D PE encodes the direction vector of a line. The result will be lower than sampling 64 points from a line, about 1%~2%. In my opinion，sample points have some advantages：（1）Sampling points are some actual spatial coordinates, and query is also generated by reference points, which can speed up convergence. If we only sample 1 point, it is difficult to determine the depth value of this point (depth = 1 or depth estimation). （2）Since the points are some actual spatial coordinates，we can transform the points of previous frame to current frame. So we can extend the 3D PE to its temperal version. If we use the direction vector (1 point ), it's difficult to do the alignment between different frames.

Aug 12 '22 06:08 yingfei1016

PETR PETR copied to clipboard

some question about position encoding in depth axis

PETR
PETR copied to clipboard