link-grammar
link-grammar copied to clipboard
Actual dict order is not as described for UTF-8 characters
In a comment in read-dict.c:
The use of suffixes means that the ordering of the words is not
* exactly the order given by strcmp. The order must be such that, for
* example, "make" < "make.n" < "make-up" -- suffixed words come after
* the bare words, but before any other other words with non-alphabetic
* characters (such as the hyphen in "make-up", or possibly UTF8
* characters). Thus, plain "strcmp" can't be used to determine
* dictionary order.
However, since the comparison in dict_order_strict()
is done using char, which is signed on x86, the above is not correct regarding makeøup
, because the UTF-8 leading char turns out then to be negative. Here is a dict order test I did with some fake words (using link-parser '-\!a*'
):
aaø.test
aa.u
aab.test
This doesn't look right. Are there any bad implications due to that?
Note that the dict_order_strict()
indeed takes care to make the subscript separator the lowest value character for comparisons, to ensure make.n" < "make-up", from the time the subscript mark was .
:
return ((*s == SUBSCRIPT_MARK)?(1):(*s)) - ((*t == SUBSCRIPT_MARK)?(1):(*t));
There are only very few such "mixed" words in the dict, so this is not actually tested. I could not create test "mixed" words that cause problems, but changing the above "return" to not preserve the make.n" < "make-up" causes dict errors.
BTW, I tried to enforce the description above on UTF-8 characters too by replacing the arguments of the dict_order_*()
functions to be "unsigned char", in a hope that the overhead will be the same and this question will be redundant, but from some reason this generates much overhead.
(Also , compiling with CFLAGS=-funsigned-char
results in ~50% slower runs.).
EDIT: to be "unsigned char" above.
I believe (not quite sure) that the only word-ordering that matters is that the subscripted words must come after the non-subscripted words. I believe (not sure) that this is required for word-lookup to work correctly. Thus, there would be bugs if this could happen:
aa
aa-foo
aa.u
or maybe
aa.u
aaø
aa
Unclear about
aa.u
aa
The reason for this is that some sort of binary-tree lookup is used, and it halts when one gets too high or low -- thus a word, and word+subscripts need to be next to each-other.
The check result that I posted indeed doesn't show the problem. Here it is again, using the words from your post.
I added in the dict:
aaø: TEST+;
aa-foo: TEST+;
$ link-parser '-\!aa*'
...
Token "aa*" matches:
aaø 1 disjuncts
aa.u 165 disjuncts <en/words/units.1>
aa-foo 1 disjuncts
So apparently the problem exists.
I will try to fix the check to behave the same for UTF-8 chars too. The challenge in that is to not cause inefficiency (this is a critical section) due to apparent signed-to-unsigned conversion, a thing that happens when naively trying to use unsigned int
or unsigned char
in this comparison.
Since SUBSCRIPT_MARK is now the lowest (valid for a dict word) possible value of an unsigned char
, if an unsigned comparison is done, its check can be omitted because it cannot change the sign of
(*s - *t)
.
The current dict_order_strict()
is:
static inline int dict_order_strict(const char *s, const Dict_node * dn)
{
const char * t = dn->string;
while (*s != '\0' && *s == *t) {s++; t++;}
return ((*s == SUBSCRIPT_MARK)?(1):(*s)) - ((*t == SUBSCRIPT_MARK)?(1):(*t));
}
Since it is a critical section, I would like to make it faster.
One way to do so is to get rid of the SUBSCRIPT_MARK comparison. Its purpose is to ensure it is the smallest value character for comparisons, so every character will be stored after it. This is not needed any more, because it is now \0x03
, which is the lowest character value in valid words.
But I then noted that this function doesn't work as intended if char
is signed (as on).
Here is a fix that I tried:
static inline int dict_order_strict(const char *s, const Dict_node * dn)
{
const char * t = dn->string;
while (*s != '\0' && *s == *t) {s++; t++;}
return (unsigned char)*s - (unsigned char)*t;
}
On GCC 7.3 (-O3), it is compiled to 16 instructions instead of the original 35. (It works correctly after changing dict_order_bare accordingly.)
To my surprize, the Russian dict read is then much slower (+50%), but the English read dict speed is slightly faster. This may hint a severe tree imbalance due to the interaction of utf-8 characters and the "new" sort order.
Do you have any idea how to debug that?
Recall that UTF8 is designed so that if the sign bit is set (the char is "negative"), then there will typically be at least one more byte, holding more of the character (unless the sixth bit is zero .. etc) . So, for russian, almost all bytes will be "negative". Why this causes an unbalance in the tree, I cannot guess. You can try to printf the return value of dict_order_strict and see if it looks wrong somehow.
There is also a possibility that the compiler is doing something unexpected (with sign extension). Try this:
((int) ((unsigned char)*s)) - ((int) ((unsigned char)*s));
Try also exchanging s and t.
The number of machine instructions of this function are about half after this change so I guess (not validated it) that no overhead added (I expect that the (unsigned char) will not cause a any actual conversion). A hint that this is the case is that there is no added overhead in English.
A try to reverse the order (I did it by using a -
prefix where the function is used) didn't fix the overhead.
Regarding the correctness of the returned value of dict_order_strict() when using (unsigned char)
, inspecting -link-parser "-\!*"
shows that the sorting is then as expected (the signed
sort has the problem I demonstrated).
Next I intend to compare in details how a word is inserted and searched in the tree in the signed and unsigned cases.
Well, insert_dict
at line 1444 inserts on the left, or on the right, depending on the return value from dict_order_strict
-- so if this is almost always returning a negative number, or almost always returning a positive number, then the tree will be unbalanced, and insertion will be slower. Or something like that. At the end, the tree is rebalanced --- maybe that is also running longer??