link-grammar
link-grammar copied to clipboard
Potentially undetected count overflow in do_count()
While running tests on "amy" with ASN/UBSAN, I got this:
... amy/link-grammar/parse/count.c:1390:7: runtime error: signed integer overflow: 2147483647 * 5569944920 cannot be represented in type 'long int'
... amy/link-grammar/parse/count.c:1390:7: runtime error: signed integer overflow: -7518702831433443131 + -6485378441171344729 cannot be represented in type 'long int'
This just happens to never occur with the other languages and the various corpus-* files, but theoretically, it could happen due to the current code.
All the calculations are done in signed 64-bit (the storage is in signed 32-bit due to a recent change, but this has no implications here). It is signed 64-bit due to historical reasons, but it can be a good idea to keep it signed because overflows could then be detected by the compiler's UBSAN checks.
The problem arises from the clipping value. It is INT_MAX, which is 2^31-1.
Observe the following code:
https://github.com/opencog/link-grammar/blob/c1b15b8d937c1da7bfdf7f82ef931888f50b1a38/link-grammar/parse/count.c#L1371-L1381
Each of the 4 do_count() calls may return INT_MAX, and hence leftcount (and similarly rightcount) can be up to 4*(2^31-1) = 2^33-4.
Then we have this code:
https://github.com/opencog/link-grammar/blob/c1b15b8d937c1da7bfdf7f82ef931888f50b1a38/link-grammar/parse/count.c#L1390
We may get here up to (total + (2^33-4) * (2^31-1) ) which may be > 2^63 .
And we also have this:
https://github.com/opencog/link-grammar/blob/c1b15b8d937c1da7bfdf7f82ef931888f50b1a38/link-grammar/parse/count.c#L1428
So we may get here (total + (2^33-4) * (2^33-4)) = which is much more than 2^63 (and total here may already be near or > 2^63.
Possible solutions:
- Do nothing, as this has empirically proved not to be a problem with "real" dicts.
- Clamp
leftcountandrightcountbefore using them in the multiplications. - In
parse_count_clamp(), clamp to2^29-1.
(2) and (3) will add a slight overhead, but by analyzing the overflow possibilities I found some places in which the efficiency can be improved:
- In the macro
CACHE_COUNT(), c is unnecessarily too wide, andcount_tcan be used instead. However (I didn't check) maybe the compiler already does such an optimization. - The
parse_count_clamp()call that has a debug printout "OVERFLOW1" is not needed since the loop can be performed no more than(2^8 * 2)times, and accumulate no more than(2*31-1) per iteration (max. total ~2^40)). To the finalparse_count_clamp()("OVERFLOW2"), after potentially accumulation up to additional2^31-1` is enough.
>>>>>>>> For now, I would choose option (2).
@linas, Please check if my analysis is correct, and I will send a PR (if needed).
I spent the last 15 minutes looking at this, but don't see a clear answer. I see this:
- (a) The safest solution is to modify all of the
hist_*defines to check for overflow. That has a performance impact. - (b) Comment out all uses of
parse_count_clamp(), review each use ofhist_*and determine which ones need to be checked, and addparse_count_clamp()back in, as needed; remove all other uses ofparse_count_clamp() - (c) Some muddled alternative to (b), involving reviewing each use of
parse_count_clamp()to see if it's actually needed. This risks missing a location that should have been checked. - (d) changing
INT_MAXto 2^29-1 and crossing ones fingers and hoping that's enough.
I think your proposal 2. is my (a) and your proposal 3. is my (d)
- (d) is the easiest.
- (a) is second-easiest, but it leads to a muddy and hard-to-read
parse/histogram.hI don't like the muddle. - (b) is the best for performance, but also intellectually the hardest.
The question is really "do I feel like a perfectionist today?" and "Do I have the mental energy to do this?" Maybe I should try, and you can review my work?
(b) Comment out all uses of parse_count_clamp(), review each use of hist_* and determine which ones need to be checked, and add parse_count_clamp() back in, as needed; remove all other uses of parse_count_clamp() (c) Some muddled alternative to (b), involving reviewing each use of parse_count_clamp() to see if it's actually needed. This risks missing a location that should have been checked.
I already mostly analyzed all of that in my post above, and my conclusion was that the changes I proposed in (2) will fix the problem:
- Remove the
parse_count_clamp()call marked as "OVERFLOW1". - Add 2
parse_count_clamp()calls: Just before usingleftcountandrightcountin a multiplication (just before the linesif (0 < hist_total(&leftcount))andif (0 < hist_total(&rightcount))). - Leave the other 2 existing
parse_count_clamp()intact.
I forgot to analyze the impact of line 1390 after applying my proposed fix.
https://github.com/opencog/link-grammar/blob/c1b15b8d937c1da7bfdf7f82ef931888f50b1a38/link-grammar/parse/count.c#L1390
In that case, leftcount is clipped. total is already clipped from the existing call to parse_count_clamp() at the end of the match loop, and count is read from a clipped table entry (or do_count() result, which is clipped). So the max. total is (2^31-1) + ((2^31-1) * (2^31-1)) < 2^62.
Then the multiplication
https://github.com/opencog/link-grammar/blob/c1b15b8d937c1da7bfdf7f82ef931888f50b1a38/link-grammar/parse/count.c#L1428
can add to it at most ((2^31-1) * (2^31-1)) < 2^62, so the result is still less than 2^63.
(it's very subtle, I hope I didn't miss something...).
(d) changing INT_MAX to 2^29-1 and crossing ones fingers and hoping that's enough. It seems to me it may not hurt to do it too.
In any case, I can send a PR for my proposal if it seems fine to you. In case you choose to do it yourself, I can review it of course.
Yes, send a pull req. At each location, add comments such as /* can add at most ((2^31-1) * (2^31-1)) < 2^62, so the result is still less than 2^63 */ ... because it is subtle, and hard to review.
I said:
- The
parse_count_clamp()call that has a debug printout "OVERFLOW1" is not needed since the loop can be performed no more than(2^8 * 2)times
The number (2^8 * 2) is incorrect, since the loop is over the disjunct. But assuming less than 2^31 disjunct per word, an overflow here is still impossible.
It doesn't seem reasonable that the program can ever handle billions of disjuncts per word, when a large part of them leads to a very high count number (overflow or near overflow). So at most, a sanity check can be done (elsewhere, when the disjuncts are counted for creating the disjunct memory block) that the number of disjuncts is less than 2^31.
a sanity check can be done
I didn't implement that, seems just as an unneeded overhead...
billions of disjuncts per word,
That won't happen. There won't even be a million.
However, in the earlier days, I had SQL dicts that had a hundred UNKNOWN-WORD, each of these with maybe 10K disjuncts. These caused combinatoric overflows and parsing timeouts, even if I set parse-timeout=600
There won't even be a million
In generations mode, we have ~3M disjuncts per word (in the middle of sentences). However, as we know this causes a vast slowness... The speed can be improved (a WIP), but the real solution is to implement the disjunct sampling method you have suggested. However, this will be for a future version (i.e. not for 5.11.0).