link-grammar
link-grammar copied to clipboard
A fundamental flaw in length_limit=1
I tried to add ID*
to the length_limit of 1 (after of course allowing this usage of ID).
I did this is order to check whether it can speed up the parsing (by maybe saving connector comparisons because the length_limit is anyway checked first).
To my surprise, the following sentences than didn't get parsed:
As yet, no one has thought of a solution.
What in God's name happened?
What in Lord's name is going on here?
The problem has to do with alternatives. In the case of "As yet", here are the 2D-array slots:
0 | 1 | 2 | 3 | ... |
---|---|---|---|---|
LEFT-WALL | As | yet | ||
as | ||||
A.u | s.u |
and since the distance between as
and yet
is 2, the idiom link cannot apply.
In the case of the other 2 sentences, the 's
is getting separated, which creates the same problem.
In addition to this problem, there is also another, as demonstrated in this sentences (not from the LG corpus- constructed for this post):
He explained why this is an "as is" clause.
Here "
creates a distance 2 for the PHc link (defined with length_limit=1).
(BTW, apparently "as is" is not defined as adjective in the dict, so this sentence is unparsable even without the quotation marks.)
However, for Russian LL*
links there is no such problem.
Maybe also not for YS
and YP
, unless we would like to support something like "ABC"'s
Per issue #42, the phrase "as is" should be treated as a noun. That is, any quoted phrase might actually be a noun, and the internal grammatical structure of the phrase should be ignored. That is, the quotation marks form a wall, preventing links from crossing over them.
From issue #42: I said:
Tokenize it as now (separating the quotes) in case the word is used in a grammatical context, and add UNKNOWN-WORD alternative for it (including the quotes). (I'm for (2), because I think that (1) disregards possible info in "word" that may still be interesting.)
And you agreed:
Option 2.
In option 2, we use both possibilities (2 alternatives): "as is" as an UNKNOWN-WORD, and also like now - separating the quotes (but not deleting them).
In any case when reading it loudly, isn't there an "an" even though it is quoted?
as a noun
Why only as a noun? I proposed UNKNOWN-WORD because it can be e.g. a verb (or another POS).
Sorry, yes, either might work. I'd have to think of examples where the quoted text would be generic unknown word instead of just nouns. But yes; sorry for confusion. All is well :-)
BTW, it is possible to overcome the difficulty that I pointed out, by making a more complex check when applying length_limit:
- First skip optional words (i.e. don't count them toward the length-limit). Need to do that in 3 places: -- expression prune. -- prune. -- fast-matcher.
- Then in sane-morphism (optional words disappear at that time if they were indeed unneeded, and appear if they were required for the linkage) enforce the length-limit literally.
However, this will add some overhead (but maybe not much if a skip table is prepared in advance).
Pull req #764 adds the QUOTED-WORD
idea from issue #756