link-grammar
link-grammar copied to clipboard
Potentially undetected count overflow in do_count()
While running tests on "amy" with ASN/UBSAN, I got this:
... amy/link-grammar/parse/count.c:1390:7: runtime error: signed integer overflow: 2147483647 * 5569944920 cannot be represented in type 'long int'
... amy/link-grammar/parse/count.c:1390:7: runtime error: signed integer overflow: -7518702831433443131 + -6485378441171344729 cannot be represented in type 'long int'
This just happens to never occur with the other languages and the various corpus-* files, but theoretically, it could happen due to the current code.
All the calculations are done in signed 64-bit (the storage is in signed 32-bit due to a recent change, but this has no implications here). It is signed 64-bit due to historical reasons, but it can be a good idea to keep it signed because overflows could then be detected by the compiler's UBSAN checks.
The problem arises from the clipping value. It is INT_MAX
, which is 2^31-1
.
Observe the following code:
https://github.com/opencog/link-grammar/blob/c1b15b8d937c1da7bfdf7f82ef931888f50b1a38/link-grammar/parse/count.c#L1371-L1381
Each of the 4 do_count() calls may return INT_MAX, and hence leftcount
(and similarly rightcount
) can be up to 4*(2^31-1)
= 2^33-4
.
Then we have this code:
https://github.com/opencog/link-grammar/blob/c1b15b8d937c1da7bfdf7f82ef931888f50b1a38/link-grammar/parse/count.c#L1390
We may get here up to (total
+ (2^33-4)
* (2^31-1)
) which may be > 2^63
.
And we also have this:
https://github.com/opencog/link-grammar/blob/c1b15b8d937c1da7bfdf7f82ef931888f50b1a38/link-grammar/parse/count.c#L1428
So we may get here (total
+ (2^33-4)
* (2^33-4)
) = which is much more than 2^63
(and total
here may already be near or > 2^63
.
Possible solutions:
- Do nothing, as this has empirically proved not to be a problem with "real" dicts.
- Clamp
leftcount
andrightcount
before using them in the multiplications. - In
parse_count_clamp()
, clamp to2^29-1
.
(2) and (3) will add a slight overhead, but by analyzing the overflow possibilities I found some places in which the efficiency can be improved:
- In the macro
CACHE_COUNT()
, c is unnecessarily too wide, andcount_t
can be used instead. However (I didn't check) maybe the compiler already does such an optimization. - The
parse_count_clamp()
call that has a debug printout "OVERFLOW1" is not needed since the loop can be performed no more than(2^8 * 2)
times, and accumulate no more than(2*31-1) per iteration (max. total ~
2^40)). To the final
parse_count_clamp()("OVERFLOW2"), after potentially accumulation up to additional
2^31-1` is enough.
>>>>>>>> For now, I would choose option (2).
@linas, Please check if my analysis is correct, and I will send a PR (if needed).
I spent the last 15 minutes looking at this, but don't see a clear answer. I see this:
- (a) The safest solution is to modify all of the
hist_*
defines to check for overflow. That has a performance impact. - (b) Comment out all uses of
parse_count_clamp()
, review each use ofhist_*
and determine which ones need to be checked, and addparse_count_clamp()
back in, as needed; remove all other uses ofparse_count_clamp()
- (c) Some muddled alternative to (b), involving reviewing each use of
parse_count_clamp()
to see if it's actually needed. This risks missing a location that should have been checked. - (d) changing
INT_MAX
to 2^29-1 and crossing ones fingers and hoping that's enough.
I think your proposal 2. is my (a) and your proposal 3. is my (d)
- (d) is the easiest.
- (a) is second-easiest, but it leads to a muddy and hard-to-read
parse/histogram.h
I don't like the muddle. - (b) is the best for performance, but also intellectually the hardest.
The question is really "do I feel like a perfectionist today?" and "Do I have the mental energy to do this?" Maybe I should try, and you can review my work?
(b) Comment out all uses of parse_count_clamp(), review each use of hist_* and determine which ones need to be checked, and add parse_count_clamp() back in, as needed; remove all other uses of parse_count_clamp() (c) Some muddled alternative to (b), involving reviewing each use of parse_count_clamp() to see if it's actually needed. This risks missing a location that should have been checked.
I already mostly analyzed all of that in my post above, and my conclusion was that the changes I proposed in (2) will fix the problem:
- Remove the
parse_count_clamp()
call marked as "OVERFLOW1". - Add 2
parse_count_clamp()
calls: Just before usingleftcount
andrightcount
in a multiplication (just before the linesif (0 < hist_total(&leftcount))
andif (0 < hist_total(&rightcount))
). - Leave the other 2 existing
parse_count_clamp()
intact.
I forgot to analyze the impact of line 1390 after applying my proposed fix.
https://github.com/opencog/link-grammar/blob/c1b15b8d937c1da7bfdf7f82ef931888f50b1a38/link-grammar/parse/count.c#L1390
In that case, leftcount
is clipped. total
is already clipped from the existing call to parse_count_clamp()
at the end of the match loop, and count
is read from a clipped table entry (or do_count()
result, which is clipped). So the max. total is (2^31-1)
+ ((2^31-1)
* (2^31-1)
) < 2^62
.
Then the multiplication
https://github.com/opencog/link-grammar/blob/c1b15b8d937c1da7bfdf7f82ef931888f50b1a38/link-grammar/parse/count.c#L1428
can add to it at most ((2^31-1)
* (2^31-1)
) < 2^62
, so the result is still less than 2^63
.
(it's very subtle, I hope I didn't miss something...).
(d) changing INT_MAX to 2^29-1 and crossing ones fingers and hoping that's enough. It seems to me it may not hurt to do it too.
In any case, I can send a PR for my proposal if it seems fine to you. In case you choose to do it yourself, I can review it of course.
Yes, send a pull req. At each location, add comments such as /* can add at most ((2^31-1) * (2^31-1)) < 2^62, so the result is still less than 2^63 */
... because it is subtle, and hard to review.
I said:
- The
parse_count_clamp()
call that has a debug printout "OVERFLOW1" is not needed since the loop can be performed no more than(2^8 * 2)
times
The number (2^8 * 2)
is incorrect, since the loop is over the disjunct. But assuming less than 2^31 disjunct per word, an overflow here is still impossible.
It doesn't seem reasonable that the program can ever handle billions of disjuncts per word, when a large part of them leads to a very high count number (overflow or near overflow). So at most, a sanity check can be done (elsewhere, when the disjuncts are counted for creating the disjunct memory block) that the number of disjuncts is less than 2^31.
a sanity check can be done
I didn't implement that, seems just as an unneeded overhead...
billions of disjuncts per word,
That won't happen. There won't even be a million.
However, in the earlier days, I had SQL dicts that had a hundred UNKNOWN-WORD
, each of these with maybe 10K disjuncts. These caused combinatoric overflows and parsing timeouts, even if I set parse-timeout=600
There won't even be a million
In generations mode, we have ~3M disjuncts per word (in the middle of sentences). However, as we know this causes a vast slowness... The speed can be improved (a WIP), but the real solution is to implement the disjunct sampling method you have suggested. However, this will be for a future version (i.e. not for 5.11.0).