link-grammar icon indicating copy to clipboard operation
link-grammar copied to clipboard

Recent 5x multiplication of verb disjuncts

Open ampli opened this issue 5 years ago • 15 comments

From 5.7.0:

linkparser> !!test.v
Token "test.v" matches:
    test.v                       18101  disjuncts

Now (248a6133d8793528f6376cd26baa3b9733001bf6):

linkparser> !!test.v
Token "test.v" matches:
    test.v                       93621  disjuncts

This also seems to cause a noticeable speed regression.

ampli avatar Dec 28 '19 00:12 ampli

Yeah. I'm guessing that this is due to the recent addition of (Wi- & Xc+ & SI*i+ & {Xc+}) to <verb-ico> which is a part of <verb-pl,i> which is used in the definition of VERB_PLI which is of the form <verb-pl,i> & ($1), and $1 is some verb difinition, e.g. for test.v it's

<vc-predict>:
  (<vc-trans> & {{Xc+} & QI+})
  or ({@MV+} & (<embed-verb> or TH+ or RSe+ or Zs- or VC+))
  or ({@MV+} & (<QI+pref> & {MV+}));

So test.v used to have the above, anded with whatever the old <verb-ico> used to be. Now, its the old stuff, and in addition, the above anded with (Wi- & Xc+ & SI*i+ & {Xc+}) .. I think I also added (Xd- & (EI- or S**i-)) which explodes another factor of two.

There's a combinatoric explosion, there's no particularly obvious way to limit it.

That's one reason why machine learning is interesting.

It might be interesting to parse a jizzilion sentences, and see which disjuncts never show up. But how to remove them is unclear: although some disjunct might never show up for one word, it might show up for another similar one, and then what? It's a morass.

linas avatar Dec 30 '19 04:12 linas

This is not the first time it happened -- I'm pretty sure the dicts from a few years ago are a lot faster...

linas avatar Dec 30 '19 04:12 linas

I produced the disjuncts of all the unique dict expressions of words. There are ~170M disjuncts in this disjunct list. Many of them don't make sense for me. For example:

  1. Long connector lists (~3M are >6, there are up to 13).
  2. Repeating sequential connectors in a connector list. Especially strange is @E @E.
  3. Some words have a very large number of disjuncts (were.v-d maybe have the most - ~450K).

Examples: From the parses of the fixes batch, only one connector list of >5 connectors has been observed, on sentences like:

    +------------------------Xp------------------------+
    +-------------------->WV-------------------->+     |
    +--------------Xp--------------+             |     |
    +----------->WV---------->+    |             |     |
    |         +------Bsw------+    |             |     |
    +--->Wq---+    +---Pg*b---+    +->Wd--+      |     |
    +-ZZZ-+   +-Rw-+SIpx+     |    +ZZZ+  +--Ss--+     |
    |     |   |    |    |     |    |   |  |      |     |
LEFT-WALL " what are.v you doing.v ?   " she asked.v-d .

            LEFT-WALL     2.000  ZZZ+ hWq+ hWV+ Xp+ hWV+ Xp+

There are ~700K jets with 2-3 duplicate sequential connectors, like:

somebody: [21](0.000)  <--> Ss*s @MXs @MXs @M EL
is.v: [7545](0.000) dWV dWV Qw Iq Xd <--> THb O*m SFIs VC
is.v: [11305](0.000) dCV dCV Qw Iq Xd <--> THb O*m SFIs VC
is.v: [15065](0.000) dIV dIV Qw Iq Xd <--> THb O*m SFIs VC
were.v-d: [32485](0.000) dIV dIV dIV Qe <--> VC THb @MV O*m @EBm SFIp
were.v-d: [33528](0.000) dIV dIV B**t dIV Qe <--> @MV VC @EBm SIpx VC
come.v: [130](0.000) @E dIV B*m <--> dVJlh VC VC
come.v: [839](0.000) dIV B*m dWV PP @E <--> VC VC VC
LEFT-WALL: [2629](3.000)  <--> QUc Xp Xp hWd QUd
different.a: [108](0.000) EA EA <--> dAJlc
night.n: [55](0.400)  <--> dSJl @M @M NM
do.v: [6459](0.000) dWV Wi @E @E <--> VC @MV @MV O
do.v: [1327](0.000) Rw <--> I*d N N SIp
be.v: [31228](0.050) dIV dIV Wi @E @E <--> Pa

Maybe some macros are included more than once?

The ones with duplicate sequential multi-connectors (like @MXs @MXs which is common) can be deleted before power_prune() as it will not delete them if there is more than one match.

ampli avatar Dec 30 '19 09:12 ampli

There's a combinatoric explosion, there's no particularly obvious way to limit it.

Maybe there is a need to make more efforts in pruning. I thought of more complex algos, that may be enabled on long sentences only (because they are relatively time consuming - the processing shortcuts I used cannot be done in them). I also have speedup demos that I still need to convert to production code.

This is not the first time it happened -- I'm pretty sure the dicts from a few years ago are a lot faster...

Yes.

And pseudo-cross-links will add even more overhead (I know as I already mostly implemented that, and I try now to reduce the overhead). BTW, I called them in the code fxlink (from fake-cross-link), but maybe another term is more appropriate (e.g. pxlink for pseudo-cross-link).

ampli avatar Dec 30 '19 09:12 ampli

The multiple @E @E is just due to sloppiness in the dictionary, the problem being that it's hard/tedious to clean up.

The complexity of "is, was, were`" is not surprising...

The duplicated dCV dCV and likewise is due to sloppiness in <vc-be-no-obj> which has <verb-wall> appearing more than once in a disjunct. I'm going to try to clean it up now ....

linas avatar Jan 02 '20 04:01 linas

I fixed the duplicated wall connecteors in were.v here: e8f862ccd5294bc3f5924a35c341d7bd1dc71bf2

linas avatar Jan 02 '20 04:01 linas

The duplicated VC is a bug ... I don't know what to say .. maintaining the English dict by hand is very tedious...

linas avatar Jan 02 '20 04:01 linas

In a WIP I have the following mode of printing expressions (can use a flag). I did it in a hope that it may help in finding out the duplicate connectors and other suspected connector sequences in disjuncts. If it seems useful I can submit that in a PR (after the disjunct printing PR).

linkparser> !!test.n
Token "test.n" matches:
    test.n                        8509  disjuncts <en/words/words.n.1-const>


Token "test.n" expressions:
    test.n
<marker-common-entity>: (XXXGIVEN+) or 
<common-const-noun>: ((
    <common-phonetic>: (((
        <noun-modifiers>: (((@A- & {[[@AN-]]}) or [@AN-]0.100 or ([[@AN-]0.100 & @A-] & {[[@AN-]]}) or ())) & (
        <noun-and-s>: ((({@M+} & dSJls+) or ({[@M+]} & dSJrs-))) or (GN+ & (DD- or [()])) or Us- or ({Ds-} & [Wa-]0.050 & ({Mf+} or {NM+})))) or (
        <nn-modifiers>: (((@A- & {[[@AN-]]}) or [@AN-]0.100 or ([[@AN-]0.100 & @A-] & {[[@AN-]]}))) & (({NMa+} & AN+) or ((NM+ or ({[NM+]1.500} & (Ds**x- or 
        <no-det-null>: ([(())]headline)))) & ((
        <noun-rel-s>: (({@M+} & {R+ & Bs+ & {[[@M+]]}} & {@MXs+})) & (
        <noun-main-s>: (((Ss*s+ & 
            <CLAUSE>: {({@hCOd-} & (C- or ((dRJrc- or dRJlc+)))) or ({@hCO-} & Wd-) or [Rn-]}) or SIs- or (Js- & ({Jk-} or {Mf+})) or Os- or 
            <post-nominal-s>: ([{[Bsj+]} & Xd- & (Xc+ or 
                <costly-null>: ([[[[()]]]])) & MX-]0.100) or 
            <costly-null>: ([[[[()]]]]))) or 
        <rel-clause-s>: (({Rw+} & Bsm+)))) or 
        <noun-and-s>: ((({@M+} & dSJls+) or ({[@M+]} & dSJrs-))))) or (YS+ & Ds**x-))))) or ({NMa+} & AN+) or ((NM+ or ({[NM+]1.500} & (Ds**c- or 
    <no-det-null>: ([(())]headline)))) & ((
    <noun-rel-s>: (({@M+} & {R+ & Bs+ & {[[@M+]]}} & {@MXs+})) & (
    <noun-main-s>: (((Ss*s+ & 
        <CLAUSE>: {({@hCOd-} & (C- or ((dRJrc- or dRJlc+)))) or ({@hCO-} & Wd-) or [Rn-]}) or SIs- or (Js- & ({Jk-} or {Mf+})) or Os- or 
        <post-nominal-s>: ([{[Bsj+]} & Xd- & (Xc+ or 
            <costly-null>: ([[[[()]]]])) & MX-]0.100) or 
        <costly-null>: ([[[[()]]]]))) or 
    <rel-clause-s>: (({Rw+} & Bsm+)))) or 
    <noun-and-s>: ((({@M+} & dSJls+) or ({[@M+]} & dSJrs-))))) or (YS+ & Ds**c-)))

The extra parens (also with regular expression printing) seem as an artifact that I need to find out and eliminate. BTW, this works by using macros as a second type of expression tag and printing them in front of subexpressions (instead of as a cost, as done with dialect components) with an indentation proportional to their nesting level.

ampli avatar Jan 08 '20 06:01 ampli

In the above, the parts that are before, between and after between macros need to be folded with indentation too so it will be clear what the content of each macro is.

ampli avatar Jan 08 '20 06:01 ampli

Yes, that could be useful. Even more useful would be an inverse-lookup: given a disjunct, where did it come from? I'm often fumbling trying to figure this out, to see if it came from <rel-clause-s> or from somewhere else...

linas avatar Jan 18 '20 19:01 linas

Even more useful would be an inverse-lookup: given a disjunct, where did it come from?

This can be easily implemented... I will add this.

I need good syntax for expression display flags. For now the syntax is: !!word for the current expression display output. !!word/regex and !!word/regex/flags for disjunct display. But I need a way to specify flags for expression display (e.g. by macros). I thought of the following way: !!word/flags for expression display. !!word/regex/flags for disjunct display.

There is also a problem of displaying tens of thousands of disjuncts. An internal paginator (like in git and gdb) may help.

ampli avatar Jan 18 '20 22:01 ampli

Most of that and more (the discussion continue at PR #1083) have got implemented by now (PR #1085). However, an internal paginator for the disjunct list has not been implemented.

I said above regarding of duplicate multi-disjuncts:

The ones with duplicate sequential multi-connectors (like @MXs @MXs which is common) can be deleted before power_prune() as it will not delete them if there is more than one match.

I tried to eliminate the duplicate copies but the speedup was unnoticeable. So I didn't include that in any PR. A more interesting experiment can be to eliminate them and also eliminate the multiple VCs and the very long disjuncts, and then propagate the result into the original expressions, simplify them and use `!!word/m' to see which parts of the expression got eliminated. (I have WIP that can propagate removed disjuncts into their originating expressions in order to simplify the expressions.)

ampli avatar Mar 02 '21 21:03 ampli

There is a bug here regarding the link to LEFT-WALL:

          test.v: [71186]0.000= @E- Qa- dWV- B*d- <> VC+ VC+
<verb-pl,i>:
    <verb-ico>: @E- &
        <verb-why>: Qa- &
            <verb-wall>: dWV- & VC+
   &
<vc-predict>:
    <vc-trans>:
        <b-minus>: B*d-
           &
        <mv-coord>: VC+

dWV- should have been a shallow connector. Here B*d- tries to connect to words before LEFT-WALL.

ampli avatar Mar 03 '21 03:03 ampli

very long disjuncts

My general impression is that there should never be any disjuncts longer than about 6 connectors long, in the English dict. I don't really know, and I have never looked at the statistics of disjunct-length versus frequency. If those stats were available, then we could probably cut anything longer than 6 (or whatever the "real world" max is.

linas avatar Mar 22 '21 19:03 linas

and I have never looked at the statistics of disjunct-length versus frequency

I looked and I think I reported elsewhere but I cannot find it for now. In the en corpora batches I found 6 disjuncts only once. In any case power_prune() is now so efficient it removes them in a negligible time.

It may be interesting to remove them and propagate the result back to the expressions, to see which macros got affected and whether any of them are made redundant. I may try to do it when I return to this disjuncts-to-expression feedback code.

ampli avatar Mar 22 '21 23:03 ampli