link-grammar icon indicating copy to clipboard operation
link-grammar copied to clipboard

A fundamental flaw in length_limit=1

Open ampli opened this issue 6 years ago • 5 comments

I tried to add ID* to the length_limit of 1 (after of course allowing this usage of ID). I did this is order to check whether it can speed up the parsing (by maybe saving connector comparisons because the length_limit is anyway checked first).

To my surprise, the following sentences than didn't get parsed:

As yet, no one has thought of a solution.
What in God's name happened?
What in Lord's name is going on here?

The problem has to do with alternatives. In the case of "As yet", here are the 2D-array slots:

0 1 2 3 ...
LEFT-WALL As yet
as
A.u s.u

and since the distance between as and yet is 2, the idiom link cannot apply.

In the case of the other 2 sentences, the 's is getting separated, which creates the same problem.

In addition to this problem, there is also another, as demonstrated in this sentences (not from the LG corpus- constructed for this post): He explained why this is an "as is" clause. Here " creates a distance 2 for the PHc link (defined with length_limit=1). (BTW, apparently "as is" is not defined as adjective in the dict, so this sentence is unparsable even without the quotation marks.)

However, for Russian LL* links there is no such problem. Maybe also not for YS and YP, unless we would like to support something like "ABC"'s

ampli avatar Mar 13 '18 20:03 ampli

Per issue #42, the phrase "as is" should be treated as a noun. That is, any quoted phrase might actually be a noun, and the internal grammatical structure of the phrase should be ignored. That is, the quotation marks form a wall, preventing links from crossing over them.

linas avatar Mar 13 '18 21:03 linas

From issue #42: I said:

Tokenize it as now (separating the quotes) in case the word is used in a grammatical context, and add UNKNOWN-WORD alternative for it (including the quotes). (I'm for (2), because I think that (1) disregards possible info in "word" that may still be interesting.)

And you agreed:

Option 2.

In option 2, we use both possibilities (2 alternatives): "as is" as an UNKNOWN-WORD, and also like now - separating the quotes (but not deleting them).

In any case when reading it loudly, isn't there an "an" even though it is quoted?

as a noun

Why only as a noun? I proposed UNKNOWN-WORD because it can be e.g. a verb (or another POS).

ampli avatar Mar 13 '18 22:03 ampli

Sorry, yes, either might work. I'd have to think of examples where the quoted text would be generic unknown word instead of just nouns. But yes; sorry for confusion. All is well :-)

linas avatar Mar 13 '18 22:03 linas

BTW, it is possible to overcome the difficulty that I pointed out, by making a more complex check when applying length_limit:

  • First skip optional words (i.e. don't count them toward the length-limit). Need to do that in 3 places: -- expression prune. -- prune. -- fast-matcher.
  • Then in sane-morphism (optional words disappear at that time if they were indeed unneeded, and appear if they were required for the linkage) enforce the length-limit literally.

However, this will add some overhead (but maybe not much if a skip table is prepared in advance).

ampli avatar Mar 13 '18 23:03 ampli

Pull req #764 adds the QUOTED-WORD idea from issue #756

linas avatar Apr 27 '18 00:04 linas