link-grammar icon indicating copy to clipboard operation
link-grammar copied to clipboard

Power pruning and duplicate and long disjunct elimination in EXPRESSIONS

Open ampli opened this issue 7 years ago • 5 comments

I started this discussion in PR #818, and I transfer it to here, with expansions. The purpose of the idea discussed here is to increase the speed and quality of the SAT parser linkage solutions by taking benefit of simplification steps that are currently done only for the classic parser (the original and default parser).

Background - current per-sentence processing

  • Tokenize.
  • Build 2D token-array when each token has its dictionary expression.
  • Expression Pruning step, which simplify these expressions.
  • Only for the classic parser: Converting to disjuncts (see below): -- Eliminate duplicate disjuncts. -- Eliminate too-long disjuncts. -- Power Pruning: Eliminate disjuncts that cannot "connect".
  • Parsing: -- The classic parser works with disjuncts (from the step above). -- The SAT parser directly uses the expressions from the Expression Pruning step.
  • Postprocessing: -- Sane-morphism (validate that tokens from different ways of tokenizing a word - are not mixed). -- Filtering by link-constraints (the SAT parser does some of this work). -- Calculating linkage metrics. -- Sorting the linkages by their metrics (currently done only for the classic parser).

More info on the above-mentioned terms and processes:

Expressions

The expressions originate in the dictionary, and contain significant duplications due to 3 reasons:

  1. They were written manually and they are complex.
  2. Due to the use of dict macros and references to other words.
  3. From tokens in the same 2D token-array slot.

Expressions are subject to a simplification process called Expression Pruning before they are actually used. This process removes from the expressions connectors that cannot connect to other words in the sentence (because they don't have matching connectors, or their matching connectors are too far away - more than their "connector length"). It simplifies the expressions, but it is limited in what it can do. Its original purpose was to save work for the next step of converting the expression to disjuncts and further eliminating of irrelevant ones.

Elimination of duplicate disjuncts

Duplicate disjuncts are harmful not only because they add extra parsing and postprocessing overhead with no benefit (after all, they carry duplicate info), but they also causes to generation of identical parses (whose number increase exponentially to the amount of duplication). Some of these duplicate parses may even be with different disjunct costs, and the current SAT solver may even produce some of them first (which is not good).

(postprocessing overhead means not only the added filtering of bad link combinations, but also added linkage-extraction, sane-morphism, and sorting-by-metric).

Elimination of disjuncts that are "too long"

These disjuncts need to connect to words before or after the sentence walls.

Elimination of disjuncts that cannot "connect"

The "Power Pruning" algo discards disjuncts that cannot participate in a linkage, e.g. due to the connector order in them.

The problem

The SAT parser analyzes expressions and generates SAT clauses which define constraints on the possible links between words so these expressions are satisfied.

It currently cannot benefit from the step of irrelevant disjunct elimination because it uses the expressions as generated by the Expression Pruning pass, with all of the duplicate and irrelevant info that is still in them.

My proposed solution to this problem

Perform all the disjunct simplification steps also for the SAT parser. Then feed back these results into the expressions (that result from the existing Expression pruning step).

The complete algorithm for this feedback is:

  1. Perform the Expression Pruning just like now.
  2. When creating the disjuncts from the expressions, keep in each disjunct's connector its position in the expression from which it got created (this involves an intermediate step of keeping this position in Tconnector).
  3. Perform all the disjunct elimination steps: Duplicate eliminations, long disjunct elimination, Power Prune. These steps are done as now, with no changes.
  4. The feedback step: For each connector in a disjunct (those that remained from step 3), mark the corresponding connector in the originating expression.
  5. Mark dead connectors: Instead of using matches_S() (see exprune.c), nullify the connectors that are not marked in step 4.
  6. Invoke purge_Exp() and clean_up_expressions() to simplify the result expressions, exactly as currently done in exprune.c.

Implementation

The feedback code is small and most of the work is done by existing functions as mentioned above. In order not to enlarge the Connector and Disjunct structs, I reused existing space in them. I had to introduce a (very slight) overhead in the disjunct creation: Remembering the position in the expression (step 2 above). This can be done by adding a field to Tconnector, or changing the definition of an existing field - I need to check which incurs less overhead.

Unfortunately (or maybe not, see below) I encountered an unforeseen problem in the duplicate disjunct elimination feedback: I couldn't completely eliminate all the duplicate disjuncts due to a subtle sane-morphism problem - I didn't find a way to propagate discarded connectors that belong to duplicate disjuncts that originate from different tokens (most of the duplicate disjuncts don't have this problem). But maybe this is a time to try to eliminate the gword-set concept (that cause this problem) of sane-morphism. I investigate replacing it by the 2D token-array slot number. If I succeed in that, this will simplify and speed up code in several other places, and will allow to remove here also duplicate disjuncts that result from different tokens.

Current results

  1. A speed up of the SAT parser results for English (fixes batch), especially a very big speed up for long sentences. I still need to validate that exactly the same linkages are produced (when removing the duplicates).
  2. Much less duplicate linkages are observed.
  3. Apparently, there is problem of cost mishandling, that causes 2 sentences from the fixes run not to get parsed with the default cost. I'm not sure about the details, and even not sure it is not due to another problem (as these two sentences used to have similar problems). I continue to investigate that.

Further work

  1. I still need to investigate why expressions are not simplified this way for Russian (a bug?).
  2. I think that some duplicates cannot be suppressed by one round of such a feedback, so I will try to regenerate disjuncts from the resulted expressions (after the feedback) and retry the feedback again.
  3. Check how much duplication there is in dict expressions.
  4. If all goes well, I will have to rewrite the hacks (I even used a low order pointer bit as a Boolean variable in order not to change old code while testing) and prepare a PR.

ampli avatar Oct 07 '18 22:10 ampli

In PR #818 I said:

BTW, maybe it is possible to simplify the current dict expressions by a such a feedback from duplicate disjunct elimination. But I don't know if this has any significant benefit (besides maybe save some expression memory) since duplicate disjunct elimination will still be needed at runtime (due to multiple tokens in each token slot).

@linas said there:

simplify the current dict expressions

Not for the files, themselves; they're just barely human-readable, and changes are always threatening to make them more unreadable.

I thought of using a step analog to the current m4 step, that produces a simplified dict file that create_dictioary_*() will prefer over 4.0.dict. Of course it is worth implementation only if it produces a significant parse speedup (due to some saving in the expression/disjunct simplification processing).

ampli avatar Oct 07 '18 22:10 ampli

A significant additional speedup resulted after I realized that the patched-SAT-parser power_prune() step is now not needed since there is nothing to do in this regard as the expression used are now already power-pruned. Somehow this solved the problem of the 2 sentences that could not be parsed due to apparent cost problem. I will continue to investigate this case later if needed.

Now the run time of the fixes batch is about the same (within 2% - no point for now to make more exact measurements) under the classic and SAT parsers. This is a speedup of about ~14%. The basic batch under the patched-SAT-parser also runs faster without this power-prune step, but it is still about 25% slower (by about 0.8 seconds) than under the classic parser.

This eliminating of power_prune() from the patched SAT parser also solved much of the slowness for ru, but it is still about 20% slower than without the patch. It seems the reason is that converting to disjuncts is very costly for ru - more than the benefit of the duplicate disjunct elimination and disjunct pruning. If I will not find a way to mitigate that, I propose to add a dict definition to to bypass (or to make) the step of expression power-pruning for the SAT-parser.

Possible disjunct-conversion cost-mitigation ways:

  1. Improve the expression-pruning by removing common or'ed subexpression, especially null expressions.
  2. Improve the efficiency of the disjunct conversion. One easy way for that is using memory pools there.

A side comment on memory pools: Using TLS variables for memory pools (instead of putting them in the Sentence struct etc.) has a potential for much speedup because there will not be a need to free them after each sentence (but only maybe to optionally trim them down if they get over some threshold).

BTW, with the released LG version, ru runs much faster under the SAT parser than under the classic one. Since the SAT parser is still much unoptimized I believe that it is possible to make it to run much faster also for short English sentences. But in order to make it useful it should be changed/fixed to produce lower-cost parses first. Maybe a different SAT-solver library is needed for that (weighted SAT?).

ampli avatar Oct 08 '18 01:10 ampli

I didn't find a way to propagate discarded connectors that belong to duplicate disjuncts that originate from different tokens

Really meant "from diferent Gwords", as the token string is the same, otherwise the disjunct would not be "duplicate".

But maybe this is a time to try to eliminate the gword-set concept (that cause this problem) of sane-morphism. I investigate replacing it by the 2D token-array slot number.

For now I could not find a way to do that. My only solution for now for eliminating any duplicate disjunct (i.e. also those that originate from different Gwords) is to have a Connector pointer in the Exp struct instead of condesc_t pointer (so its originating_gword field can be set appropriately). But using Connector there for the classic parser is not needed and will only add overhead. So such expressions can be used only for the SAT parser. (Most of the code will not need a change if an unnamed union is used for that.)

ampli avatar Oct 08 '18 02:10 ampli

2. Improve the efficiency of the disjunct conversion. One easy way for that is using memory pools there.

Done in PR #829.

ampli avatar Oct 11 '18 01:10 ampli

Also: as mentioned in #830: the way that AND, OR is used with E_lists is not very intuitive; it does not feel "natural". I tried to fix this, once, and mostly got it to work, except that I ran into issues in the way the SAT used the existing E_list design; this brought my efforts to a halt.

linas avatar Oct 15 '18 23:10 linas