apriltag icon indicating copy to clipboard operation
apriltag copied to clipboard

Adding check for length 0 in fit_line to avoid NaN

Open jrepp opened this issue 3 years ago • 11 comments

When dividing by 0 a NaN is introduced to the lineparm, this is eventually introduced to the quads and causes failures later in detection such as quad building.

I tested this with no failing inputs such as https://github.com/AprilRobotics/apriltag/issues/72 as well as verifying I didn't introduce regressions in known good detections.

jrepp avatar Feb 24 '22 08:02 jrepp

I want to understand this issue more before adding a fix like this. Theoretically I don't think we should be ending up in the case where length == 0 unless something else has already gone wrong, for example if we are trying to fit a line to a single point or something like that.

mkrogius avatar Feb 26 '22 18:02 mkrogius

@jrepp Can you run this again with the (now correctly working) AddressSanitizer (see https://github.com/AprilRobotics/apriltag/pull/216 on how to enable and use it)? I got numerous division by zero errors for the line fitting. This may be an issue resulting from an earlier unnoticed error.

christian-rauch avatar Feb 28 '22 12:02 christian-rauch

Happy to re-run this.

The specific case I'm able to reproduce is in the line fitting using the eigenvalue solution.

When Cxx or Cyy exactly equal the eigenvalue it forces nx1 or ny2 to 0 causing a divide by 0 - this being a NaN which caused the downstream errors.

I haven't thoroughly studied the line fitting to provide a good suggestion for this. My understanding is that given a cluster of points you can use error distance from the eigenvector of the expected line to do accept/reject lines. It's not well documented locally but that's my guess from squinting at the code :)

jrepp avatar Mar 01 '22 04:03 jrepp

./apriltag_demo crash-april.pnm -x 1
../common/zhash.c:545:22: runtime error: left shift of 211382645 by 7 places cannot be represented in type 'int'
loading crash-april.pnm
image: crash-april.pnm 2448x2048

../apriltag_quad_thresh.c:244:25: runtime error: division by zero
../apriltag_quad_thresh.c:245:25: runtime error: division by zero
../apriltag.c:755:37: runtime error: -nan is outside the range of representable values of type 'int'
../apriltag.c:795:29: runtime error: -nan is outside the range of representable values of type 'int'
../apriltag.c:796:29: runtime error: -nan is outside the range of representable values of type 'int'
../apriltag.c:838:24: runtime error: division by zero
../apriltag.c:838:37: runtime error: division by zero
../apriltag.c:839:26: runtime error: division by zero
../apriltag.c:840:26: runtime error: division by zero
../apriltag.c:841:26: runtime error: division by zero
WRN: Matrix is singular.
 0                             init        0.013000 ms        0.013000 ms
 1                       blur/sharp        0.052000 ms        0.065000 ms
 2                        threshold      150.189000 ms      150.254000 ms
 3                        unionfind      260.142000 ms      410.396000 ms
 4                    make clusters      363.206000 ms      773.602000 ms
 5            fit quads to clusters      322.048000 ms     1095.650000 ms
 6                            quads        3.182000 ms     1098.832000 ms
 7                decode+refinement        9.202000 ms     1108.034000 ms
 8                        reconcile        0.014000 ms     1108.048000 ms
 9                     debug output        0.005000 ms     1108.053000 ms
10                          cleanup        0.056000 ms     1108.109000 ms
hamm     0     0     0     0     0     0     0     0     0     0     1108.096   129
Summary
hamm     0     0     0     0     0     0     0     0     0     0     1108.096   129


jrepp avatar Mar 01 '22 04:03 jrepp

So, you get a bunch of division by zero and -nan is outside the range of representable values of type 'int' errors too. But are those the cause or the effect of the issue that you try to fix with this PR? I would suggest that we fix those errors first and then see if the error, that you are trying to solve with this PR, still exists.

christian-rauch avatar Mar 01 '22 11:03 christian-rauch

The output I provided is before my proposed fix - after the fix I still get two div/0 which I will track down and add to this PR.

If the first issue is not fixed and the NaN makes it into the quad data it causes more downstream errors.

jrepp avatar Mar 01 '22 15:03 jrepp

My question was essentially if your issue still exists if the reported errors are fixed. The reported issues can either

  1. cause the issue that you want to fix with this PR,
  2. will be fixed after your PR,
  3. are independent of this PR.

We now know that the second case is not fully the case, since you still see errors. But do you still see the same errors after your PR, or are some of the reported issues fixed when your PR is merged?

christian-rauch avatar Mar 01 '22 15:03 christian-rauch

@jrepp Does the issue that you try to solve with this PR still persist? Can you upload example data (your crash-april.pnm) with the steps to reproduce the issue?

christian-rauch avatar May 19 '22 10:05 christian-rauch

When building with ASAN enabled and using the image provided in https://github.com/AprilRobotics/apriltag/issues/72#issuecomment-843868463, I am getting similar error messages.

As pnm file (convert bad-homography.png bad-homography.pnm, bad-homography.pnm.zip) running apriltag_demo bad-homography.pnm -x 1 gives:

../apriltag_quad_thresh.c:244:25: runtime error: division by zero
../apriltag_quad_thresh.c:245:25: runtime error: division by zero
../apriltag.c:762:37: runtime error: -nan is outside the range of representable values of type 'int'
../apriltag.c:802:29: runtime error: -nan is outside the range of representable values of type 'int'
../apriltag.c:803:29: runtime error: -nan is outside the range of representable values of type 'int'
../apriltag.c:845:24: runtime error: division by zero
../apriltag.c:845:37: runtime error: division by zero
../apriltag.c:846:26: runtime error: division by zero
../apriltag.c:847:26: runtime error: division by zero
../apriltag.c:848:26: runtime error: division by zero
apriltag.c:450:homography_compute2(): WRN: Matrix is singular.
loading /home/christian/Downloads/bad-homography.pnm
image: /home/christian/Downloads/bad-homography.pnm 2448x2048
 0                             init        0.021000 ms        0.021000 ms
 1                       blur/sharp        0.077000 ms        0.098000 ms
 2                        threshold      233.291000 ms      233.389000 ms
 3                        unionfind      232.551000 ms      465.940000 ms
 4                    make clusters      335.949000 ms      801.889000 ms
 5            fit quads to clusters      290.553000 ms     1092.442000 ms
 6                            quads        3.004000 ms     1095.446000 ms
 7                decode+refinement        7.805000 ms     1103.251000 ms
 8                        reconcile        0.008000 ms     1103.259000 ms
 9                     debug output        0.004000 ms     1103.263000 ms
10                          cleanup        0.018000 ms     1103.281000 ms
hamm     0     0     0     0     0     0     0     0     0     0     1103.260   129
Summary
hamm     0     0     0     0     0     0     0     0     0     0     1103.260   129

As jpg file (convert bad-homography.png bad-homography.jpg, bad-homography.jpg.zip) running apriltag_demo bad-homography.jpg -x 1 gives:

../common/pjpeg.c:596:60: runtime error: left shift of negative value -1
../common/pjpeg.c:622:64: runtime error: left shift of negative value -1
../common/pjpeg-idct.c:296:88: runtime error: left shift of negative value -906
../common/pjpeg-idct.c:348:18: runtime error: left shift of negative value -7248
../common/pjpeg.c:595:57: runtime error: shift exponent -1 is negative
../common/pjpeg-idct.c:299:18: runtime error: left shift of negative value -906
../common/pjpeg-idct.c:333:26: runtime error: left shift of negative value -24
../common/pjpeg-idct.c:288:24: runtime error: left shift of negative value -4
../common/pjpeg.c:621:61: runtime error: shift exponent -1 is negative
../apriltag_quad_thresh.c:912:26: runtime error: division by zero
../apriltag_quad_thresh.c:912:44: runtime error: division by zero
../apriltag.c:845:24: runtime error: division by zero
../apriltag.c:845:37: runtime error: division by zero
../apriltag.c:846:26: runtime error: division by zero
../apriltag.c:847:26: runtime error: division by zero
../apriltag.c:848:26: runtime error: division by zero
loading /home/christian/Downloads/bad-homography.jpg
image: /home/christian/Downloads/bad-homography.jpg 2448x2048
 0                             init        0.005000 ms        0.005000 ms
 1                       blur/sharp        0.015000 ms        0.020000 ms
 2                        threshold      121.424000 ms      121.444000 ms
 3                        unionfind      164.627000 ms      286.071000 ms
 4                    make clusters      201.147000 ms      487.218000 ms
 5            fit quads to clusters      187.988000 ms      675.206000 ms
 6                            quads        2.873000 ms      678.079000 ms
 7                decode+refinement        5.599000 ms      683.678000 ms
 8                        reconcile        0.004000 ms      683.682000 ms
 9                     debug output        0.003000 ms      683.685000 ms
10                          cleanup        0.013000 ms      683.698000 ms
hamm     0     0     0     0     0     0     0     0     0     0      683.693    92
Summary
hamm     0     0     0     0     0     0     0     0     0     0      683.693    92

Note the different errors reported by ASAN per image file type.


@jrepp With your fix-quad-nan branch, I am still getting errors:

../common/zhash.c:545:22: runtime error: left shift of 211382645 by 7 places cannot be represented in type 'int'
../apriltag_quad_thresh.c:917:26: runtime error: division by zero
../apriltag_quad_thresh.c:917:44: runtime error: division by zero
loading /home/christian/Downloads/bad-homography.pnm

and

../common/zhash.c:545:22: runtime error: left shift of 211382645 by 7 places cannot be represented in type 'int'
../common/pjpeg.c:596:60: runtime error: left shift of negative value -1
../common/pjpeg.c:622:64: runtime error: left shift of negative value -1
../common/pjpeg-idct.c:296:88: runtime error: left shift of negative value -906
../common/pjpeg-idct.c:348:18: runtime error: left shift of negative value -7248
../common/pjpeg.c:595:57: runtime error: shift exponent -1 is negative
../common/pjpeg-idct.c:299:18: runtime error: left shift of negative value -906
../common/pjpeg-idct.c:333:26: runtime error: left shift of negative value -24
../common/pjpeg-idct.c:288:24: runtime error: left shift of negative value -4
../common/pjpeg.c:621:61: runtime error: shift exponent -1 is negative
../apriltag_quad_thresh.c:917:26: runtime error: division by zero
../apriltag_quad_thresh.c:917:44: runtime error: division by zero
../apriltag.c:838:24: runtime error: division by zero
../apriltag.c:838:37: runtime error: division by zero
../apriltag.c:839:26: runtime error: division by zero
../apriltag.c:840:26: runtime error: division by zero
../apriltag.c:841:26: runtime error: division by zero
loading /home/christian/Downloads/bad-homography.jpg

Is this PR supposed to solve these issues?

christian-rauch avatar May 19 '22 10:05 christian-rauch

@jrepp Can you comment on which errors are supposed to be fixed by your PR? Is it only supposed to fix the runtime error: -nan is outside the range of representable values of type 'int', which only happens in the pnm image, or is it also supposed to fix the runtime error: division by zero (and others), which appears in multiple places?

christian-rauch avatar Jun 28 '22 13:06 christian-rauch

Sorry, I've lost some of the context of this change and I need some time to address your feedback. I'll get back to you soon.

jrepp avatar Jul 02 '22 01:07 jrepp

Superseded by https://github.com/AprilRobotics/apriltag/pull/317.

christian-rauch avatar Feb 05 '24 20:02 christian-rauch