link-grammar
link-grammar copied to clipboard
Sentence: bird flu was observed in which countries?
In an unrelated search I encountered page 358 of "Intelligent Information and Database Systems: 8th Asian Conference ..., Part 2".
This conference was in 2016, but according to their benchmark time it seems they used the original CMU version (a common thing), but the problem is the same:
+--------------------------------Xp--------------------------------+
+--------------->WV--------------->+ |
+------>Wd------+ | |
| +--AN--+---Ss--+----Pvf---+---MVp--+-------Jp------+ |
| | | | | | | |
LEFT-WALL bird.n flu.n-u was.v-d observed.v-d in.r [which] countries.n ?
(They turned out using another parser.)
On the other hand, in which countries?
does parse:
Found 1 linkage (1 had no P.P. violations)
Unique linkage, cost vector = (UNUSED=0 DIS= 2.00 LEN=4)
+-------------Xp-------------+
| +------Jp-----+ |
+-->Wj--+-JQ-+---Dmc--+ |
| | | | |
LEFT-WALL in.r which countries.n ?
Here in.r
uses its disjunct Wj- & JQ+ & J+
to attach to which countries
, so as a test I tried adding the disjunct MVp- & JQ+ & J+
.
It didn't work and the question is why. Fixing this as needed may also be interesting.
Get a more detailed help on a variable as in "!help var".
linkparser> !bad
Display of bad linkages turned on.
linkparser> bird flu was observed in which countries ?
Found 2 linkages (0 had no P.P. violations)
Linkage 1 (bad), cost vector = (UNUSED=0 DIS= 0.20 LEN=12)
"Misuse of preposition13"
+-------------------------------Xp-------------------------------+
+--------------->WV--------------->+ |
+------>Wd------+ | +------Jp-----+ |
| +--AN--+---Ss--+----Pvf---+---MVp--+-JQ-+---Dmc--+ |
| | | | | | | | |
LEFT-WALL bird.n flu.n-u was.v-d observed.v-d in.r which countries.n ?
so there it is: "Misuse of preposition13"
Fixing this requires ... being clever. Usually by finding similar sentences that work, and stealing ideas from those. Simply disabling "Misuse of preposition13"
will just increase the number of failures in corpus-basic
.
Also -- if the proceedings have an e-mail, please do send them and email and remind them that more modern versions exist ...
bird flu was observed where?
bird flu was observed how?
bird flu was observed when?
when was bird flu observed?
In which countries was bird flu observed?
The first three fail completely; the last two work fine. The first three are "inverted questions" . Note how the last two use SI (inverted subject), which suggests that the first three need a new kind of link, maybe "QP" for "inverted question" Something like this:
----->+
| +---
Pvf---+---QP---+-JQ
| |
observed.v-d in.r wh
But then you have to invent something to prevent QP from being used to parse I saw in which room
. Hmmmm See https://www.abisource.com/projects/link-grammar/dict/section-JQ.html
Oh, OK, so then Pvf- & QP+
would work, it seems. That's because Pv
is used for "was verbed" constructions, which are valid for inverted questions, but would not allow "I saw in which". To make it even tighter, use Pvf- & (WV- or CV-) & QP+
so that the participle must be identified as the head-verb.
Maybe instead of inventing a new link QP, there is some existing link we can reuse. Not sure, would have to review the documentation. It's likely that a new link might be needed, since questions are ... very different than normal sentences,and also LG is weaker with questions.
(above comment edited)
LG is weaker with questions
This is pity, since people try to use it for decoding queries.
Also -- if the proceedings have an e-mail, please do send them and email and remind them that more modern versions exist ...
BTW, about 2 weeks ago I sent a letter on a similar thing to Prof. Ahn, who very recently (Oct 2019) published this paper on a system in which LG is used: A Function as a Service Based Fog Robotic System for Cognitive Robots. (No answer yet.)
Its a pity
Do you want to try fixing it, or should I?
Do you want to try fixing it, or should I?
I tried just to add MVp
in the Misuse of preposition13
rule and on first glance it looks fine:
--- a/data/en/4.0.dict
+++ b/data/en/4.0.dict
in.r:
<alter-preps>
or ({JQ+} & (J+ or Mgp+ or IN+) & (<prep-main-a> or FM-))
or K-
or (EN- & (Pp- or J-))
or <locative>
or [MVp- & B-]
or (MG- & JG+)
- or <null-prep-qu>;
+ or <null-prep-qu>
+ or (MVp- & JQ+ & J+);
--- a/data/en/4.0.knowledge
+++ b/data/en/4.0.knowledge
- JQ , Mj Wj MX#j , "Misuse of preposition13" ,
+ JQ , Mj Wj MX#j MVp , "Misuse of preposition13" ,
It didn't change the number of errors in corpus-basic
.
In corpus-fixes
it reduced the number of errors from 379 to 373, when these sentences are now parsed:
Sophy wondered up to what number she should count
Sophy wondered up to what number to count
Sophy wondered up to what number to count to
Sophy wondered up to whose favorite number she should count
Sophy wondered up to whose favorite number to count
Sophy wondered up to whose favorite number to count to
Since they don't include in.r
, this is only due to the addition of MVp
in the said PP rule.
Summary of errors by corpus:
corpus | now | patched | diff | linkage-limit |
---|---|---|---|---|
basic | 82 | 82 | 0 | 1000 |
fixes | 379 | 373 | -6 | 1000 |
fix-long | 9 | 9 | 0 | 10000 |
failures | 1556 | 1555 | -1 | 1000 |
pandp-union | 2016 | 2007 | -9 | 1000 |
pandp-union | 1998 | 1990 | -8 | 30000 |
With the long-sentences batches I just tried -limit=30000
. The pandp-union
corpus processing then takes much time and maybe a lower value would be enough (I have more to say about that...).
The difference between the number of "fixed" sentences in pandp-union
seems to be due to a different number of "combinatorial explosions" due to the changed rules (but I'm not sure - we can fine the different sentence and investigate it).
So based on these checks maybe this change is fine. However, I guess you will want to investigate:
- The reason of the unexpected fixes in
corpus-fixes
. - Some additional correct sentences that didn't parse before.
- Some additional wrong sentences that didn't parse before (as needed) - to validate that the proposed patch doesn't cause them to parse.
Minor editing of my previous message (table diff value + missing open parenthesize).
bird flu was observed where? bird flu was observed how? bird flu was observed when?
I tried to fix them by brute force, by adding what seems to be a missingQI+
when Pvf-
is present, as hinted by:
linkparser>
Linkage 2, cost vector = (UNUSED=1 DIS= 0.20 LEN=9)
+------------------------Xp-----------------------+
+--------------->WV--------------->+ |
+------>Wd------+ | |
| +--AN--+---Ss--+----Pvf---+ |
| | | | | |
LEFT-WALL bird.n flu.n-u was.v-d observed.v-d [where] ?
Press RETURN for the next linkage.
linkparser>
Linkage 3, cost vector = (UNUSED=1 DIS= 1.10 LEN=10)
+----------------------Xp---------------------+
+-------------->WV-------------->+ |
+------>Wd------+ | |
| +--AN--+-------Ss-------+---QI---+ |
| | | | | |
LEFT-WALL bird.n flu.n-u [was] observed.v-d where ?
Instead of just adding QI+
, I added the macro in which it resides. I have no idea if this is better.
predicted.v-d realized.v-d discovered.v-d determined.v-d announced.v-d
mentioned.v-d admitted.v-d recalled.v-d revealed.v-d divulged.v-d
stated.v-d observed.v-d indicated.v-d stammered.v-d bawled.v-d
analysed.v-d analyzed.v-d
assessed.v-d established.v-d evaluated.v-d examined.v-d questioned.v-d
tested.v-d hypothesized.v-d hypothesised.v-d well-established.v-d
envisaged.v-d documented.v-d:
((<verb-sp,pp> & (<vc-predict>)) or
(<verb-and-sp-i-> & ([<vc-predict>]0.2 or ())) or
((<vc-predict>) & <verb-and-sp-i+>) or
<verb-and-sp-t>)
- or (<verb-s-pv> & {THi+})
+ or (<verb-s-pv> & ({THi+} or <vc-predict>))
or <verb-adj>
or <verb-phrase-opener>;
The result is that these sentences get parsed, with no additional errors in the 5 tested corpus batches. E,g,:
+-----------------------Xp----------------------+
+--------------->WV--------------->+ |
+------>Wd------+ | |
| +--AN--+---Ss--+----Pvf---+---QI---+ |
| | | | | | |
LEFT-WALL bird.n flu.n-u was.v-d observed.v-d where ?
Supposing this is correct (I don't know), then still:
- There is a need to justify adding the whole
<vc-predict>
and not just part of it. - There is a need to think on examples that may break this addition.
- Maybe it is needed for other verbs too, so this change has to be done in one (or more) of the verb macros.
I've been hacking on this, look at my branch "qi" I have not tested for regressions.
Regarding MVp, the page https://www.abisource.com/projects/link-grammar/dict/section-JQ.html gives the example: "*I saw in which room"
So pull req #1051 fixes this but I did not measure reqgressions. I'm also contemplating chaning Misuse of preposition14
so that "You slept with who?" will parse.
And .. in the finest of traditions, the changes to the dict mean that all run-times are now slower by 10% or 20% or something like that ... al of your performance tuning gets blown away by some fairly minor dict changes that one might think would not matter.
Perhaps it's wrong to think of them as "minor" -- {QI+} is now & with lots of common verbs: did said, and many many others. The total number of expressions is significatntly larger, the total number of disjuncts is larger. .. It would be interesting to look at these totals, and the distributions of them, for typical dictionaries, over time.
It would be much simpler, and also be interesting to see how dictionaries from different eras compare on performance, on the current parser.
I remeasured performance, correctly, this time; the performance hit is minor
I fetched your "qi" branch and made some tests.
Regarding MVp, the page https://www.abisource.com/projects/link-grammar/dict/section-JQ.html gives the example: "*I saw in which room"
The problem with my (and your) fix to bird flu was observed in which countries?
it that now the said example "*I saw in which room" does parse.
It seems to me that the root of the problem is that in the fix we threat "was observed" as "passive participles" i.e. a verb and then there is no way to distinguish the different cases (as "saw" is a verb too).
So I propose instead that the role of ""was observed" in that sentence is "predicate adjective" , and at this role its should use Pa & JQ & J+
.
I..e. something is predicated and on that basis we ask a question where
, when
, in which countries
etc.
This way in the Misuse of preposition13
rule we can require Pa
instead of Mvp
, and this Pa
should also be added to Misuse of preposition14
.
This proposal doesn't handle the "up to" sentences, so they remain unfixed. I think their fix is different, so we can discuss it later (unless it seems to you related).
To check this proposal O made this changes:
--- a/data/en/4.0.dict
+++ b/data/en/4.0.dict
predicted.v-d realized.v-d discovered.v-d determined.v-d announced.v-d
...
((<verb-sp,pp> & (<vc-predict>)) or
(<verb-and-sp-i-> & ([<vc-predict>]0.2 or ())) or
((<vc-predict>) & <verb-and-sp-i+>) or
<verb-and-sp-t>)
or (<verb-s-pv> & {THi+})
+ or (Pa- & (MVp+ or <vc-predict>))
or <verb-adj>
or <verb-phrase-opener>;
in.r:
<alter-preps>
or ({JQ+} & (J+ or Mgp+ or IN+) & (<prep-main-a> or FM-))
or K-
or (EN- & (Pp- or J-))
or <locative>
or [MVp- & B-]
or (MG- & JG+)
- or <null-prep-qu>;
+ or <null-prep-qu>
+ or (MVp- & JQ+ & J+);
--- a/data/en/4.0.knowledge
+++ b/data/en/4.0.knowledge
- JQ , Mj Wj MX#j , "Misuse of preposition13" ,
- Jw , Mj Wj MX#j , "Misuse of preposition14" ,
+ JQ , Mj Wj MX#j Pa , "Misuse of preposition13" ,
+ Jw , Mj Wj MX#j Pa , "Misuse of preposition14" ,
Results:
...
+-------------------------------Xp-------------------------------+
+---------->WV--------->+ |
+------>Wd------+ | +------Jp-----+ |
| +--AN--+---Ss--+----Pa----+---MVp--+-JQ-+---Dmc--+ |
| | | | | | | | |
LEFT-WALL bird.n flu.n-u was.v-d observed.v-d in.r which countries.n ?
...
+-----------------------Xp----------------------+
+---------->WV--------->+ |
+------>Wd------+ | |
| +--AN--+---Ss--+----Pa----+---QI---+ |
| | | | | | |
LEFT-WALL bird.n flu.n-u was.v-d observed.v-d where ?
And, as needed, "*I saw in which room" doesn't parse:
+---->WV--->+
+->Wd--+Sp*i+-MVp-+------Ju------+
| | | | |
LEFT-WALL I.p saw.w in.r [which] room.n-u
...
!bad
...
"Misuse of preposition13"
+---->WV--->+ +-----Js----+
+->Wd--+Sp*i+-MVp-+-JQ-+-Ds**c+
| | | | | |
LEFT-WALL I.p saw.w in.r which room.s
Corpus error count:
corpus | now | patched | diff | linkage-limit |
---|---|---|---|---|
basic | 82 | 82 | 0 | 1000 |
fixes | 379 | 379 | 0 | 1000 |
fix-long | 9 | 9 | 0 | 10000 |
failures | 1556 | 1554 | -2 | 1000 |
pandp-union | 2016 | 2011 | -5 | 1000 |
slower by 10% or 20% or something like that
Can it be that you tested it on intermediate changes? For me the slowness of your "qi" branch is only a very few percents at most. In any case I have a WIP on improving expression handling and also pruning (both expression and power) so this may allow increasing the dict complexity without much more overhead.
We can also look at that from another angle: Improving the library speed will allow a much more complex dict without being too sluggish.
I remeasured performance, correctly, this time; the performance hit is minor
Only now I see that you addressed that by now...
After you applied PR #1051, we get:
linkparser> Sophy wondered up to what number to count to
Found 28 linkages (28 had no P.P. violations)
Linkage 1, cost vector = (UNUSED=0 DIS= 6.00 LEN=14)
+-------->WV------->+-----MVp-----+-----J-----+---------B---------+
+-->Wd---+---Ss*s---+---MVa--+ +-JQ-+-Ds**c+---R--+--I--+--MVp-+
| | | | | | | | | |
LEFT-WALL Sophy.f wondered.v-d up.e to.r what number.n to.r count.v to.r
Among other things, this seems to me wrong:
Ss*s---+---MVa--+
| |
wondered.v-d up.e
Isn't up
a modifier of to
and not wondered
?
Compare the symmetric sentence in the context of reverse counting:
Sophy wondered down to what number to count to
Clearly down
here is not a verb modifier.
Can the problem be solved by attaching up
etc. to to
using Mj
?
BTW, I also don't think 'up to' here is an idiom, because instead of up
I can think of some other words
(down, approximately, nearly, exactly).
Compare to that:
linkparser> Sophy wondered right to which one she should stand on the stage
Found 218 linkages (8 had no P.P. violations)
Linkage 1, cost vector = (UNUSED=0 DIS= 0.53 LEN=34)
+-------------------------MVp-------------------------+
| +--------------------Mp--------------------+
| | +-------------CV------------>+ |
| | +------Cs-----+ | |
+------->CPx--------+ | +----Js---+ | | +---Js---+
+-->Wa---+ +---SIsj---+---Mj--+-JQ-+-Ds-+ +--Ss--+---I---+ | +Ds**c+
| | | | | | | | | | | | |
LEFT-WALL Sophy.f wondered.q-d right.n-u to.r which one she should.v stand.v on the stage.n
I saw in which room
This is actually ambiguous. In the surface, it seems like an absurd sentence, but it's a plausible reply to the question: "Did you see in which room they held bingo night?" Anyway, your proposal can be simplified to:
--- a/data/en/4.0.knowledge
+++ b/data/en/4.0.knowledge
@@ -217,8 +217,8 @@ CONTAINS_ONE_RULES:
Mj , Jw JQ , "Incorrect relative10" ,
MX#j , Jw JQ , "Incorrect relative11" ,
Wj , Jw JQ , "Misuse of preposition12" ,
- JQ , Mj Wj MX#j MVp , "Misuse of preposition13" ,
- Jw , Mj Wj MX#j , "Misuse of preposition14" ,
+ JQ , Mj Wj MX#j Pv , "Misuse of preposition13" ,
+ Jw , Mj Wj MX#j Pv , "Misuse of preposition14" ,
B#j , Jr , "Incorrect relative15" ,
Jr , B#j , "Incorrect relative16" ,
; The two below prevent "How big?" and "How quickly?"
Also, yes, the Sophy sentences are broken
I think up to
could be an idiom, here, because:
*Sophy wondered exactly to what number to count to
Sophy wondered exactly what number to count to
How high did it go?
Up to what mark did it reach?
Exactly what mark did it reach?
up to where did it go?
Up to how many gallons were lost?
down to which floor did it drop?
down to what depravities did he sink?
The above are easily fixed by up_to down_to: EW+;
A proper fix for the others requires link-crossing. This is best illustrated by pondering the sentence: "Sophy wondered [up to] whose favorite number she should count to" and then realizing that [up to] needs to modify "number" not "whose". Unfortunately, this is not possible without link-crossing.
There is a work-around for link-crossing, but it is hacky: I did it once, here: see Jj and Jk at bottom of page at https://www.abisource.com/projects/link-grammar/dict/section-J.html
Doing such a hack in the dozen-plus cases where it is needed is painful and ugly. I would rather be able to say "link X can cross link Y or Z once". I don't think two crossings are ever needed. I don't think that allowing anything to cross anything is generally allowed. The README has accumulated a bunch of these...
I would rather be able to say "link X can cross link Y or Z once".
Will it be fine to do the hack automatically on dict read according to such definitions?
Will it be fine to do the hack automatically on dict read according to such definitions?
Hmm. That's an intersting idea. Yeah, maybe I like it. So we need several things:
-
Some way to write down "link X can cross link Y" in the dictionary.
-
by analogy to
Jj
andJk
, your hack to auto generateXj
andXk
and then auto-addXj- & Y & Xk+
Yeah, I like that. The tricky part to 2 is to put the subscripts in a slot which is unused. Maybe we could put them in the "first" slot, like h
and d
for head/dependent, except they're cross-from-left and cross-from right, so maybe l
and r
and the ascii diagrammer could use parents to print them! Like so:
+------------------+
| +--)|(--------+
| | | |
He had been allowed to eat a cake by Sophy that she had
so the parents make a little "tunnel" where the link crosses, and the logic for that would be just like drawing the arrow-heads for h
and d
arrows. So yeah, that seems slick...
To be clear: you would auto-add lX- & Y & rX+
...it might even be possible to do this with an m4 macro hack. Ugh.
The example in https://www.abisource.com/projects/link-grammar/dict/section-J.html is actually complex, because, there, the J link cross two others: it crosses both the I
and the VJlpi
links...
So for this example, Yikes... its yucky. In the current dict, its I- & Jj- & VJlpi- & VJrpi+ & Jk+
and so that's not obvious that J
is crossing I
and VJlpi
but is not crossing VJrpi+
... what a mess.
The render would be
+-------I---------+
| +--VJlpi----+
| | +-Js--)|(----------Js----------+
| +-MVp+ +--VJrpi-+--MVp-+---Js--+
| | | | | | |
... to.r look.v at and.j-v listen.v to.r everything
and the notation would be I- & lJs- & VJlpi- & VJrpi+ & rJs+
Instead of l
and r
maybe s
and r
because l
and 1
and I
all look alike too much. s
is for Latin sinister
.
Or p
and q
as a visually mirror-symmetric pair. Or w
and v
. Or e
and a
Hmm. except for p and q, it appears that the Latin alphabet was explicitly designed to avoid mirror-symmetric letters. Interesting. This is also the case for cyrillic and greek. ... interesting ...
When I look at !!and-j-v
, I see several other VJlx- & VJrx+
constructs, and even a very similar one to the one that has the Jj
Jk
device: ({Xd-} & hVJlpi- & {N+} & {TO+} & hVJrpi+)
.
Why they don't also need this device, especially the last one that also includes hVJlpi- & hVJrpi+
?
Another question: I don't like the complication of using the UC front position. Is there something bad in using a bool
mark in the Connector struct?
The Jj
- Jk
device is "recent" (well, OK maybe over a year old now) and is used in only one place (OK, now maybe two), and was created as an experiment to see how well it works (how convenient or confusing it is, how much trouble it causes vs. how much trouble it saves...) It was never deployed on a wide-scale basis. The post above (https://github.com/opencog/link-grammar/issues/1050#issuecomment-557770152) is the newest/best way I can think of of making it fully generic and "obvious".
the UC front position
I don't understand the question. The goal of fronting UC is to have a notation in 4.0.dict to indicate that "opposites connect". Maybe this could be moved so that it comes after the +/- connector-dir. Or allows additional symbols besides +/- ...
The goal of fronting UC is to have a notation in 4.0.dict to indicate that "opposites connect".
What is special about their matching rules?
For me it seems as if regular Js- and Js+ connectors are fine, and only the code that draws the diagram needs to know they denote a cross-link (and hence my suggested bool
mark).
and only the code that draws the diagram needs to know
And how will it know this? it's not just J that might cross, it could be .. A or S or a dozen others.
And how will it know this? it's not just J that might cross, it could be .. A or S or a dozen others.
My ideas is that these connectors (in your example A, S and others) that serve as "bypass" connectors will be marked in their connector struct. Questions: Why is there any need to explicitly make these marks in the connector string? Is there any code, beside the diagram drawing code, that needs to be aware that there is anything special here?
in their connector struct.
I'm not concerned with how they are handled in the C code. Finding a good representation for 4.0.dict is my primary concern.
Is there any code that needs to be aware
Presumably, "most" applications of LG are interested in the dependency diagram, in the abstract, as a graph, and now, as a non-planar graph. So there will need to be a step that says "a hah, here's something that in LG looks like two links, but its really only just one." The app itself could figure that out, or we could provide that extra step ourselves, in the LG api. In addition, the app might want to know which links cross.
The only real problem with this is that there are very few, approaching zero apps of LG, at least, that are public, that anyone talks about. Every now and then I get hints of proprietary apps, but they never seem heavily vested. So all this is very hypothetical.
Yes, other parts of opencog use LG, but ... not very well, not very robustly, not very deeply.
I'm not concerned with how they are handled in the C code. Finding a good representation for 4.0.dict is my primary concern.
But the whole idea is that the special "bypass" connectors don't appear at all in the in 4.0.dict, as they are only inserted later by the LG library code. So how they can be represented there?
What is to be representaed there is something like:
<XLINK>: Js+ & VJrpi-; % Connection from Js may cross the rest of connectors.
(For now it seems to me no need to specify the less deeper connectors too that it also would cross, unless this may lead to incorrect parses. I also don't know if an exact connector match should be done for VJrpi-
or "easy-match".)
If you mean to their representation in printing of the actual expression which is used (or its disjuncts) then it is really doesn't mater from programming standpoint, in which representation they are displayed, and indeed the most convenience representation should be used.
and the notation would be I- & lJs- & VJlpi- & VJrpi+ & rJs+ Instead of l and r maybe s and r because l and 1 and I all look alike too much. s is for Latin sinister. Or p and q as a visually mirror-symmetric pair. Or w and v . Or e and a
I still didn't understand if the LG library code should make any special interpretation of these leading LC letters (supposing it already knows these are "bypass" connectors - after all the LG library code knows what it added). For now it seems to me these special letters don't play any role in the connector matching algorithm (unlike h/d and the rest of the letters in the connector string), and even not in the drawing algorithm (since it is already known which connectors are the "bypass" ones).
I guess I'm not clear enough in my questions and proposals, or that I didn't understand something (or both). I will try to make a real implementation and see if it works fine, but still answers to the above would help.
What also would help me are additional diagrams of the desired results. E.g. for some of the "Sophy" sentences (only links from words that have cross-links are needed).