apriltag
apriltag copied to clipboard
Adding check for length 0 in fit_line to avoid NaN
When dividing by 0 a NaN is introduced to the lineparm, this is eventually introduced to the quads and causes failures later in detection such as quad building.
I tested this with no failing inputs such as https://github.com/AprilRobotics/apriltag/issues/72 as well as verifying I didn't introduce regressions in known good detections.
I want to understand this issue more before adding a fix like this. Theoretically I don't think we should be ending up in the case where length == 0
unless something else has already gone wrong, for example if we are trying to fit a line to a single point or something like that.
@jrepp Can you run this again with the (now correctly working) AddressSanitizer (see https://github.com/AprilRobotics/apriltag/pull/216 on how to enable and use it)? I got numerous division by zero
errors for the line fitting. This may be an issue resulting from an earlier unnoticed error.
Happy to re-run this.
The specific case I'm able to reproduce is in the line fitting using the eigenvalue solution.
When Cxx or Cyy exactly equal the eigenvalue it forces nx1 or ny2 to 0 causing a divide by 0 - this being a NaN which caused the downstream errors.
I haven't thoroughly studied the line fitting to provide a good suggestion for this. My understanding is that given a cluster of points you can use error distance from the eigenvector of the expected line to do accept/reject lines. It's not well documented locally but that's my guess from squinting at the code :)
./apriltag_demo crash-april.pnm -x 1
../common/zhash.c:545:22: runtime error: left shift of 211382645 by 7 places cannot be represented in type 'int'
loading crash-april.pnm
image: crash-april.pnm 2448x2048
../apriltag_quad_thresh.c:244:25: runtime error: division by zero
../apriltag_quad_thresh.c:245:25: runtime error: division by zero
../apriltag.c:755:37: runtime error: -nan is outside the range of representable values of type 'int'
../apriltag.c:795:29: runtime error: -nan is outside the range of representable values of type 'int'
../apriltag.c:796:29: runtime error: -nan is outside the range of representable values of type 'int'
../apriltag.c:838:24: runtime error: division by zero
../apriltag.c:838:37: runtime error: division by zero
../apriltag.c:839:26: runtime error: division by zero
../apriltag.c:840:26: runtime error: division by zero
../apriltag.c:841:26: runtime error: division by zero
WRN: Matrix is singular.
0 init 0.013000 ms 0.013000 ms
1 blur/sharp 0.052000 ms 0.065000 ms
2 threshold 150.189000 ms 150.254000 ms
3 unionfind 260.142000 ms 410.396000 ms
4 make clusters 363.206000 ms 773.602000 ms
5 fit quads to clusters 322.048000 ms 1095.650000 ms
6 quads 3.182000 ms 1098.832000 ms
7 decode+refinement 9.202000 ms 1108.034000 ms
8 reconcile 0.014000 ms 1108.048000 ms
9 debug output 0.005000 ms 1108.053000 ms
10 cleanup 0.056000 ms 1108.109000 ms
hamm 0 0 0 0 0 0 0 0 0 0 1108.096 129
Summary
hamm 0 0 0 0 0 0 0 0 0 0 1108.096 129
So, you get a bunch of division by zero
and -nan is outside the range of representable values of type 'int'
errors too. But are those the cause or the effect of the issue that you try to fix with this PR?
I would suggest that we fix those errors first and then see if the error, that you are trying to solve with this PR, still exists.
The output I provided is before my proposed fix - after the fix I still get two div/0 which I will track down and add to this PR.
If the first issue is not fixed and the NaN makes it into the quad data it causes more downstream errors.
My question was essentially if your issue still exists if the reported errors are fixed. The reported issues can either
- cause the issue that you want to fix with this PR,
- will be fixed after your PR,
- are independent of this PR.
We now know that the second case is not fully the case, since you still see errors. But do you still see the same errors after your PR, or are some of the reported issues fixed when your PR is merged?
@jrepp Does the issue that you try to solve with this PR still persist? Can you upload example data (your crash-april.pnm
) with the steps to reproduce the issue?
When building with ASAN enabled and using the image provided in https://github.com/AprilRobotics/apriltag/issues/72#issuecomment-843868463, I am getting similar error messages.
As pnm file (convert bad-homography.png bad-homography.pnm
, bad-homography.pnm.zip) running apriltag_demo bad-homography.pnm -x 1
gives:
../apriltag_quad_thresh.c:244:25: runtime error: division by zero
../apriltag_quad_thresh.c:245:25: runtime error: division by zero
../apriltag.c:762:37: runtime error: -nan is outside the range of representable values of type 'int'
../apriltag.c:802:29: runtime error: -nan is outside the range of representable values of type 'int'
../apriltag.c:803:29: runtime error: -nan is outside the range of representable values of type 'int'
../apriltag.c:845:24: runtime error: division by zero
../apriltag.c:845:37: runtime error: division by zero
../apriltag.c:846:26: runtime error: division by zero
../apriltag.c:847:26: runtime error: division by zero
../apriltag.c:848:26: runtime error: division by zero
apriltag.c:450:homography_compute2(): WRN: Matrix is singular.
loading /home/christian/Downloads/bad-homography.pnm
image: /home/christian/Downloads/bad-homography.pnm 2448x2048
0 init 0.021000 ms 0.021000 ms
1 blur/sharp 0.077000 ms 0.098000 ms
2 threshold 233.291000 ms 233.389000 ms
3 unionfind 232.551000 ms 465.940000 ms
4 make clusters 335.949000 ms 801.889000 ms
5 fit quads to clusters 290.553000 ms 1092.442000 ms
6 quads 3.004000 ms 1095.446000 ms
7 decode+refinement 7.805000 ms 1103.251000 ms
8 reconcile 0.008000 ms 1103.259000 ms
9 debug output 0.004000 ms 1103.263000 ms
10 cleanup 0.018000 ms 1103.281000 ms
hamm 0 0 0 0 0 0 0 0 0 0 1103.260 129
Summary
hamm 0 0 0 0 0 0 0 0 0 0 1103.260 129
As jpg file (convert bad-homography.png bad-homography.jpg
, bad-homography.jpg.zip) running apriltag_demo bad-homography.jpg -x 1
gives:
../common/pjpeg.c:596:60: runtime error: left shift of negative value -1
../common/pjpeg.c:622:64: runtime error: left shift of negative value -1
../common/pjpeg-idct.c:296:88: runtime error: left shift of negative value -906
../common/pjpeg-idct.c:348:18: runtime error: left shift of negative value -7248
../common/pjpeg.c:595:57: runtime error: shift exponent -1 is negative
../common/pjpeg-idct.c:299:18: runtime error: left shift of negative value -906
../common/pjpeg-idct.c:333:26: runtime error: left shift of negative value -24
../common/pjpeg-idct.c:288:24: runtime error: left shift of negative value -4
../common/pjpeg.c:621:61: runtime error: shift exponent -1 is negative
../apriltag_quad_thresh.c:912:26: runtime error: division by zero
../apriltag_quad_thresh.c:912:44: runtime error: division by zero
../apriltag.c:845:24: runtime error: division by zero
../apriltag.c:845:37: runtime error: division by zero
../apriltag.c:846:26: runtime error: division by zero
../apriltag.c:847:26: runtime error: division by zero
../apriltag.c:848:26: runtime error: division by zero
loading /home/christian/Downloads/bad-homography.jpg
image: /home/christian/Downloads/bad-homography.jpg 2448x2048
0 init 0.005000 ms 0.005000 ms
1 blur/sharp 0.015000 ms 0.020000 ms
2 threshold 121.424000 ms 121.444000 ms
3 unionfind 164.627000 ms 286.071000 ms
4 make clusters 201.147000 ms 487.218000 ms
5 fit quads to clusters 187.988000 ms 675.206000 ms
6 quads 2.873000 ms 678.079000 ms
7 decode+refinement 5.599000 ms 683.678000 ms
8 reconcile 0.004000 ms 683.682000 ms
9 debug output 0.003000 ms 683.685000 ms
10 cleanup 0.013000 ms 683.698000 ms
hamm 0 0 0 0 0 0 0 0 0 0 683.693 92
Summary
hamm 0 0 0 0 0 0 0 0 0 0 683.693 92
Note the different errors reported by ASAN per image file type.
@jrepp With your fix-quad-nan
branch, I am still getting errors:
../common/zhash.c:545:22: runtime error: left shift of 211382645 by 7 places cannot be represented in type 'int'
../apriltag_quad_thresh.c:917:26: runtime error: division by zero
../apriltag_quad_thresh.c:917:44: runtime error: division by zero
loading /home/christian/Downloads/bad-homography.pnm
and
../common/zhash.c:545:22: runtime error: left shift of 211382645 by 7 places cannot be represented in type 'int'
../common/pjpeg.c:596:60: runtime error: left shift of negative value -1
../common/pjpeg.c:622:64: runtime error: left shift of negative value -1
../common/pjpeg-idct.c:296:88: runtime error: left shift of negative value -906
../common/pjpeg-idct.c:348:18: runtime error: left shift of negative value -7248
../common/pjpeg.c:595:57: runtime error: shift exponent -1 is negative
../common/pjpeg-idct.c:299:18: runtime error: left shift of negative value -906
../common/pjpeg-idct.c:333:26: runtime error: left shift of negative value -24
../common/pjpeg-idct.c:288:24: runtime error: left shift of negative value -4
../common/pjpeg.c:621:61: runtime error: shift exponent -1 is negative
../apriltag_quad_thresh.c:917:26: runtime error: division by zero
../apriltag_quad_thresh.c:917:44: runtime error: division by zero
../apriltag.c:838:24: runtime error: division by zero
../apriltag.c:838:37: runtime error: division by zero
../apriltag.c:839:26: runtime error: division by zero
../apriltag.c:840:26: runtime error: division by zero
../apriltag.c:841:26: runtime error: division by zero
loading /home/christian/Downloads/bad-homography.jpg
Is this PR supposed to solve these issues?
@jrepp Can you comment on which errors are supposed to be fixed by your PR? Is it only supposed to fix the runtime error: -nan is outside the range of representable values of type 'int'
, which only happens in the pnm
image, or is it also supposed to fix the runtime error: division by zero
(and others), which appears in multiple places?
Sorry, I've lost some of the context of this change and I need some time to address your feedback. I'll get back to you soon.
Superseded by https://github.com/AprilRobotics/apriltag/pull/317.