link-grammar
link-grammar copied to clipboard
Recent 5x multiplication of verb disjuncts
From 5.7.0:
linkparser> !!test.v
Token "test.v" matches:
test.v 18101 disjuncts
Now (248a6133d8793528f6376cd26baa3b9733001bf6):
linkparser> !!test.v
Token "test.v" matches:
test.v 93621 disjuncts
This also seems to cause a noticeable speed regression.
Yeah. I'm guessing that this is due to the recent addition of (Wi- & Xc+ & SI*i+ & {Xc+})
to <verb-ico>
which is a part of <verb-pl,i>
which is used in the definition of VERB_PLI
which is of the form <verb-pl,i> & ($1)
, and $1
is some verb difinition, e.g. for test.v
it's
<vc-predict>:
(<vc-trans> & {{Xc+} & QI+})
or ({@MV+} & (<embed-verb> or TH+ or RSe+ or Zs- or VC+))
or ({@MV+} & (<QI+pref> & {MV+}));
So test.v
used to have the above, anded with whatever the old <verb-ico>
used to be. Now, its the old stuff, and in addition, the above anded with (Wi- & Xc+ & SI*i+ & {Xc+})
.. I think I also added (Xd- & (EI- or S**i-))
which explodes another factor of two.
There's a combinatoric explosion, there's no particularly obvious way to limit it.
That's one reason why machine learning is interesting.
It might be interesting to parse a jizzilion sentences, and see which disjuncts never show up. But how to remove them is unclear: although some disjunct might never show up for one word, it might show up for another similar one, and then what? It's a morass.
This is not the first time it happened -- I'm pretty sure the dicts from a few years ago are a lot faster...
I produced the disjuncts of all the unique dict expressions of words. There are ~170M disjuncts in this disjunct list. Many of them don't make sense for me. For example:
- Long connector lists (~3M are >6, there are up to 13).
- Repeating sequential connectors in a connector list. Especially strange is
@E @E
. - Some words have a very large number of disjuncts (
were.v-d
maybe have the most - ~450K).
Examples:
From the parses of the fixes
batch, only one connector list of >5 connectors has been observed, on sentences like:
+------------------------Xp------------------------+
+-------------------->WV-------------------->+ |
+--------------Xp--------------+ | |
+----------->WV---------->+ | | |
| +------Bsw------+ | | |
+--->Wq---+ +---Pg*b---+ +->Wd--+ | |
+-ZZZ-+ +-Rw-+SIpx+ | +ZZZ+ +--Ss--+ |
| | | | | | | | | | |
LEFT-WALL " what are.v you doing.v ? " she asked.v-d .
LEFT-WALL 2.000 ZZZ+ hWq+ hWV+ Xp+ hWV+ Xp+
There are ~700K jets with 2-3 duplicate sequential connectors, like:
somebody: [21](0.000) <--> Ss*s @MXs @MXs @M EL
is.v: [7545](0.000) dWV dWV Qw Iq Xd <--> THb O*m SFIs VC
is.v: [11305](0.000) dCV dCV Qw Iq Xd <--> THb O*m SFIs VC
is.v: [15065](0.000) dIV dIV Qw Iq Xd <--> THb O*m SFIs VC
were.v-d: [32485](0.000) dIV dIV dIV Qe <--> VC THb @MV O*m @EBm SFIp
were.v-d: [33528](0.000) dIV dIV B**t dIV Qe <--> @MV VC @EBm SIpx VC
come.v: [130](0.000) @E dIV B*m <--> dVJlh VC VC
come.v: [839](0.000) dIV B*m dWV PP @E <--> VC VC VC
LEFT-WALL: [2629](3.000) <--> QUc Xp Xp hWd QUd
different.a: [108](0.000) EA EA <--> dAJlc
night.n: [55](0.400) <--> dSJl @M @M NM
do.v: [6459](0.000) dWV Wi @E @E <--> VC @MV @MV O
do.v: [1327](0.000) Rw <--> I*d N N SIp
be.v: [31228](0.050) dIV dIV Wi @E @E <--> Pa
Maybe some macros are included more than once?
The ones with duplicate sequential multi-connectors (like @MXs @MXs
which is common) can be deleted before power_prune()
as it will not delete them if there is more than one match.
There's a combinatoric explosion, there's no particularly obvious way to limit it.
Maybe there is a need to make more efforts in pruning. I thought of more complex algos, that may be enabled on long sentences only (because they are relatively time consuming - the processing shortcuts I used cannot be done in them). I also have speedup demos that I still need to convert to production code.
This is not the first time it happened -- I'm pretty sure the dicts from a few years ago are a lot faster...
Yes.
And pseudo-cross-links will add even more overhead (I know as I already mostly implemented that, and I try now to reduce the overhead). BTW, I called them in the code fxlink (from fake-cross-link), but maybe another term is more appropriate (e.g. pxlink for pseudo-cross-link).
The multiple @E @E
is just due to sloppiness in the dictionary, the problem being that it's hard/tedious to clean up.
The complexity of "is, was, were`" is not surprising...
The duplicated dCV dCV
and likewise is due to sloppiness in <vc-be-no-obj>
which has <verb-wall>
appearing more than once in a disjunct. I'm going to try to clean it up now ....
I fixed the duplicated wall connecteors in were.v
here: e8f862ccd5294bc3f5924a35c341d7bd1dc71bf2
The duplicated VC
is a bug ... I don't know what to say .. maintaining the English dict by hand is very tedious...
In a WIP I have the following mode of printing expressions (can use a flag). I did it in a hope that it may help in finding out the duplicate connectors and other suspected connector sequences in disjuncts. If it seems useful I can submit that in a PR (after the disjunct printing PR).
linkparser> !!test.n
Token "test.n" matches:
test.n 8509 disjuncts <en/words/words.n.1-const>
Token "test.n" expressions:
test.n
<marker-common-entity>: (XXXGIVEN+) or
<common-const-noun>: ((
<common-phonetic>: (((
<noun-modifiers>: (((@A- & {[[@AN-]]}) or [@AN-]0.100 or ([[@AN-]0.100 & @A-] & {[[@AN-]]}) or ())) & (
<noun-and-s>: ((({@M+} & dSJls+) or ({[@M+]} & dSJrs-))) or (GN+ & (DD- or [()])) or Us- or ({Ds-} & [Wa-]0.050 & ({Mf+} or {NM+})))) or (
<nn-modifiers>: (((@A- & {[[@AN-]]}) or [@AN-]0.100 or ([[@AN-]0.100 & @A-] & {[[@AN-]]}))) & (({NMa+} & AN+) or ((NM+ or ({[NM+]1.500} & (Ds**x- or
<no-det-null>: ([(())]headline)))) & ((
<noun-rel-s>: (({@M+} & {R+ & Bs+ & {[[@M+]]}} & {@MXs+})) & (
<noun-main-s>: (((Ss*s+ &
<CLAUSE>: {({@hCOd-} & (C- or ((dRJrc- or dRJlc+)))) or ({@hCO-} & Wd-) or [Rn-]}) or SIs- or (Js- & ({Jk-} or {Mf+})) or Os- or
<post-nominal-s>: ([{[Bsj+]} & Xd- & (Xc+ or
<costly-null>: ([[[[()]]]])) & MX-]0.100) or
<costly-null>: ([[[[()]]]]))) or
<rel-clause-s>: (({Rw+} & Bsm+)))) or
<noun-and-s>: ((({@M+} & dSJls+) or ({[@M+]} & dSJrs-))))) or (YS+ & Ds**x-))))) or ({NMa+} & AN+) or ((NM+ or ({[NM+]1.500} & (Ds**c- or
<no-det-null>: ([(())]headline)))) & ((
<noun-rel-s>: (({@M+} & {R+ & Bs+ & {[[@M+]]}} & {@MXs+})) & (
<noun-main-s>: (((Ss*s+ &
<CLAUSE>: {({@hCOd-} & (C- or ((dRJrc- or dRJlc+)))) or ({@hCO-} & Wd-) or [Rn-]}) or SIs- or (Js- & ({Jk-} or {Mf+})) or Os- or
<post-nominal-s>: ([{[Bsj+]} & Xd- & (Xc+ or
<costly-null>: ([[[[()]]]])) & MX-]0.100) or
<costly-null>: ([[[[()]]]]))) or
<rel-clause-s>: (({Rw+} & Bsm+)))) or
<noun-and-s>: ((({@M+} & dSJls+) or ({[@M+]} & dSJrs-))))) or (YS+ & Ds**c-)))
The extra parens (also with regular expression printing) seem as an artifact that I need to find out and eliminate. BTW, this works by using macros as a second type of expression tag and printing them in front of subexpressions (instead of as a cost, as done with dialect components) with an indentation proportional to their nesting level.
In the above, the parts that are before, between and after between macros need to be folded with indentation too so it will be clear what the content of each macro is.
Yes, that could be useful. Even more useful would be an inverse-lookup: given a disjunct, where did it come from? I'm often fumbling trying to figure this out, to see if it came from <rel-clause-s>
or from somewhere else...
Even more useful would be an inverse-lookup: given a disjunct, where did it come from?
This can be easily implemented... I will add this.
I need good syntax for expression display flags.
For now the syntax is:
!!word
for the current expression display output.
!!word/regex
and !!word/regex/flags
for disjunct display.
But I need a way to specify flags for expression display (e.g. by macros).
I thought of the following way:
!!word/flags
for expression display.
!!word/regex/flags
for disjunct display.
There is also a problem of displaying tens of thousands of disjuncts. An internal paginator (like in git
and gdb
) may help.
Most of that and more (the discussion continue at PR #1083) have got implemented by now (PR #1085). However, an internal paginator for the disjunct list has not been implemented.
I said above regarding of duplicate multi-disjuncts:
The ones with duplicate sequential multi-connectors (like
@MXs @MXs
which is common) can be deleted beforepower_prune()
as it will not delete them if there is more than one match.
I tried to eliminate the duplicate copies but the speedup was unnoticeable. So I didn't include that in any PR.
A more interesting experiment can be to eliminate them and also eliminate the multiple VC
s and the very long disjuncts, and then propagate the result into the original expressions, simplify them and use `!!word/m' to see which parts of the expression got eliminated. (I have WIP that can propagate removed disjuncts into their originating expressions in order to simplify the expressions.)
There is a bug here regarding the link to LEFT-WALL
:
test.v: [71186]0.000= @E- Qa- dWV- B*d- <> VC+ VC+
<verb-pl,i>:
<verb-ico>: @E- &
<verb-why>: Qa- &
<verb-wall>: dWV- & VC+
&
<vc-predict>:
<vc-trans>:
<b-minus>: B*d-
&
<mv-coord>: VC+
dWV-
should have been a shallow connector. Here B*d-
tries to connect to words before LEFT-WALL
.
very long disjuncts
My general impression is that there should never be any disjuncts longer than about 6 connectors long, in the English dict. I don't really know, and I have never looked at the statistics of disjunct-length versus frequency. If those stats were available, then we could probably cut anything longer than 6 (or whatever the "real world" max is.
and I have never looked at the statistics of disjunct-length versus frequency
I looked and I think I reported elsewhere but I cannot find it for now. In the en
corpora batches I found 6 disjuncts only once. In any case power_prune()
is now so efficient it removes them in a negligible time.
It may be interesting to remove them and propagate the result back to the expressions, to see which macros got affected and whether any of them are made redundant. I may try to do it when I return to this disjuncts-to-expression feedback code.