link-grammar icon indicating copy to clipboard operation
link-grammar copied to clipboard

An additional speedup improvement for long sentences - with a catch

Open ampli opened this issue 6 years ago • 19 comments

I tested a proof-of concept optimization that yields ~60% speedup for the fix-long batch. Using it, the most CPU consuming sentence in it ("And yet [...]" - total of 174 tokens) parses for me at about 31 seconds (but more optimizations in the framework of this idea are still possible). The linkages of the (ru and en) basic, and fixes batches found to be the same (I cannot test all the linkages of the long sentences in fix-long since they have combinatorial explosion).

The code has these main problems:

  1. It is not so "nice" and it clutters do_count() (mainly at 3 points).
  2. It may be hard to verify it by reading.

It can be optimized very much more, but it may become even less readable.

I guess you may want to see it and comment on it. This may help me to finish it if desired. It is still in the proof-of-concept coding style, but if desired I can send it as a PR that is not to be applied.

ampli avatar Nov 29 '18 20:11 ampli

I sent my WIP as PR #861. Here is the idea:

The fast-matcher brings us a list of disjuncts from a middle-word, that match both of the given left and right connectors. Most of these lists contribute nothing to the intermediate linkage counts, because at least one of the sides is often "not grammatical" and yields a count of 0.

Due to the nature of the disjuncts, many of them contain the same connectors, and even the same connector sequences. We can observe that (this description uses le but for re it is similar, with one addition [*]):

  1. In the fast matcher, the same endpoint connector will always select the same candidate disjuncts.
  2. Between 2 certain words, the same trailing connector sequences (all end, of course, with the same connector), will have the the same linkage count on the next level of do_count(). To see that, let's see one them: do_count(ctxt, lw, w, le->next, d->left->next, lnull_cnt) lw and w are the 2 certain words, le is the endpoint connector, d is a selected disjunct. Since the same disjuncts will be selected by the same endpoint trailing connector sequences and only their next connectors of each are used here, the returned count will be always the same. This is true for the rest of the do_count() calls that use the endpoint connectors.
  3. If all the counts due to a particular endpoint trailing connector sequence are 0, there will never be any total count when such an endpoint is tried. This fact can be cached. Note that no selected disjuncts at all, which is very common, is cached here too as 0. Also note that the other optimizations (at the first level only) must be disabled in order to ensure we really have 0 count.
  4. The triple (lw, w, le) has much less combinations than the currently cached counts by (lw,rw,le,re), so it is much more scalable to use it as a key for this caching.
  5. We get the full list of selected disjuncts for a particular non-NULL endpoint if the other endpoint is NULL. So we can create the cache entry when we find this combination (of any non-NULL and NULL endpoints). If the list didn't yields any count, we cache 0, else we cache 1.
  6. If we find a 0 cache for a particular endpoint even when the other endpoint is a non-NULL (so only a subset of its disjuncts get actually selected), we still know that its count will be 0 and so will be the total count. So we can skip the counting altogether.

[*] The addition for re is that we need to check the case of l_bnr.

Now that I look again on the code it seems I was too conservative with the possible optimizations, and it can be optimized even more in that way. Optimizations for null_count>0 are possible too, but the code is rather complex (but parsing with null_count>0 contain also a large number of do_count() calls with null_count==0 so this optimization is very effective there too).

ampli avatar Nov 30 '18 01:11 ampli

I said:

Using it, the most CPU consuming sentence in it ("And yet [...]" - total of 174 tokens) parses for me at about 31 seconds (but more optimizations in the framework of this idea are still possible).

Actually, it fails to parse with null count=0 at this time, and there is no improvement for this particular sentence at !short=20 (only ~60% for the whole batch). However, at !short=40 this sentence has a speed up of 23%.

I couldn't get a linkage even with null_count>0 (!sort=40 and !limit=250000):

link-grammar: Warning: Combinatorial explosion! nulls=1 cnt=2147483647
Consider retrying the parse with the max allowed disjunct cost set lower.
At the command line, use !cost-max
Found 2147483647 linkages (0 of 26194 random linkages had no P.P. violations) at null count 1

This demonstrate a bug in the robust-parsing handling, as on 0 actual linkages it should have proceed with a higher null_count. I still need to investigate that.

Also, the number 26194 (instead of the expected 25000) a needs an investigation. First there is this: link-grammar: Info: sane_morphism(): 17897 of 18897 linkages had invalid morphology construction But what happens to the rest (up to 250000)? I don't know yet.

What I found is why there are so many insane-morphism linkages. From the 2-D tokenizer array:

  word132: broken-hearted,
   alt0: broken-hearted, {,}
   alt1: broken-hearted

This unneeded alternative is created because broken-hearted, matches a regex of HYPHENATED-WORDS. In case a word matches a regex and also can break, alternatives are created because this handles well some sentences in our corpus batches. But it should not happen when the split is on ending punctuation.

Since this is an artifact of the current regex for HYPHENATED-WORDS, I propose to add: <HYPHENATED-WORDS>: !/[[:punct:]]$/

When added, the parsing of this sentence reaches null_count==2, but doesn't proceed even though Found 2147483647 linkages (0 of 1000 random linkages had no P.P. violations) at null count 2

I think it should be fixed to try a higher null_count in such cases.

ampli avatar Nov 30 '18 02:11 ampli

The broken-hearted sentence is marvelous, isn't it?

linas avatar Dec 03 '18 04:12 linas

Indeed. Now I only need to find out why it doesn't get parsed....

ampli avatar Dec 03 '18 19:12 ampli

This: he is ready to have a scene and to become miserable but oddly, the simpler he is ready to jump and to run does parse. Hmm.

linas avatar Dec 03 '18 20:12 linas

the verb "be". he is ready to have a scene and to run parses he is ready to have a scene and to be sad fails.

linas avatar Dec 03 '18 20:12 linas

2 observations after this dict fix:

  1. In order of this sentence to parse, it needs something like !limit=10000` (and also then it has only one pp-valid linkage).
  2. Now the sentence " Agreeing that [...]" doesn't parse within the batch !limit=1000. It now it needs at least !limit=1361 in order to get parsed.

Since a common problem with long sentences is a very low density of pp-valid linkages, I propose to increase the linkage_limit for the long-fixes batch to at least 10000 (or even more).

3 BTW:

  1. It is interesting to run batches of long sentences with !verbosity=2 while change various parse options.
  2. You may note that the Built parse set step for very long sentences sometimes take a noticeable time in the order of seconds. (In the fix-long batch only one sentence has this high time, but you can see higher times if you parse with nulls of with a high !short value.). Most of this time may be reduced by working with suffix_id's instead of connector addresses. But when I tried that I got what seems like disjunct cost problems (the costs in the output of !disjuncts are sometimes incorrect). And indeed the disjuncts of connectors which have the same suffix_id may have different costs and shouldn't be cached together. I tried to fix that with setting unique suffix_id also according to connector costs, but from some reason that I don't understand it did't fix the problem (maybe I had some bug). In any case I will try again, this time by another method - hashing by suffix_id + disjunct costs (in mk_parse_set()).
  3. For (1), I use something like: sed 's/verbosity=/d' data/en/corpus-fix-long,batch | link-parser -v=2 For changing parse options I have to add more components to the sed arguments, like: sed '/verbosity=/d;/limit=1000/d' data/en/corpus-fix-long.batch | link-parser -v=2 -limit=10000 I thought of adding new kind of argument style to overdrive batch variable settings, something like: link-parser -verbosity:=2 -limit:=10000 < data/en/corpus-fix-long,batch Variables so assigned will ignore trying to assign them by regular =.

ampli avatar Dec 04 '18 01:12 ampli

Status report: I adapted the code also for null_cunt>0. The speedup for the sentence "When Sir [...]" (fix-long batch) is 5x (the null_count=0 step). There are no panics when running the fix-long batch with the default timeout (30 seconds).

The main problems with this code are:

  1. It is very complex and clutters do_count(). But maybe it can be encapsulated into 2 functions, one called at the start of do_count() (mainly to act on the cached value), and one at the end (to cache the result). A way to disable it can be added in order to parse without it if it is suspected to malfunction.
  2. Since the optimization is disabled in order to get a result for 0-count caching, parses that are normally useless are done. The result is a very big connector table (by several folds). This slows down parses with high null-count. It also may cause an abnormal high CPU time for the "Built parse set" (mk_parse_set()) step, but this seems fixable (a fix that should be done in any case). BTW, the added 0-count cache entries are only a small fraction of the original connector table size, so this is not a problem.

I will continue to work on that, but I think that in any case it it better not to include it in the next release (however, an improved caching in mk_parse_set() should be including - yet to be done). I have also some other fixes to send, some of them to somewhat speed up also short sentences.

BTW:

  • Post processing is a main bottleneck for short sentences, and my guess is that using connector descriptors will speed it up very much.
  • The trailing connectors idea can also be used for connector memory sharing. I still need to test that.
  • By looking at detailed traces of do_count() it seems most of its work is needlessly repeated. However, it is still not clear to me how to effectively reduce it.

In any case I will close PR #861 since it was only a demo of this method.

ampli avatar Dec 21 '18 21:12 ampli

I am preparing release 5.6.0 right now. Let me know if I should wait.

linas avatar Dec 21 '18 22:12 linas

One of numerous discussion threads suggests using two dictionaries: the normal one, and if the parse fails, then a second one with an extra NULL- & stuff on every disjunct, and NULL+ on every word. This would eliminate the need for the null-word parsing code.

It might be possible to have this in just one dictionary, and set the cost so that the cost on NULL is power-pruned in the first pass, and only allowed later, by increasing the cost.

linas avatar Dec 21 '18 22:12 linas

I am preparing release 5.6.0 right now. Let me know if I should wait.

For long sentences mk_parse_set() may take very long time (even many minutes). I think I can fix that, and thought that maybe such a fix can be included in 5.6.0. So please wait a few days more...

I also have a ready PR (that needs some extra checks) that uses the same connector cache table (of do_count()) for both null_count==0 and null_count>0 stages (currently in the one-step parse only, but this can be extended), and it also includes a change to link-parser to use a one-step parse on !null=1. Since it is ready maybe it can be included in 5.6.0 too. This fix has also some speed up improvement also for relatively short sentences since it generates the suffix_id's only after power_prune(), which means less connectors to process.

Other ready changes that need final integration is the conversion of some functions from recursive to iterative (with a speed up). But with this we can wait to the next release.

ampli avatar Dec 21 '18 22:12 ampli

One of numerous discussion threads suggests using two dictionaries: the normal one, and if the parse fails, then a second one with an extra NULL- & stuff on every disjunct, and NULL+ on every word. This would eliminate the need for the null-word parsing code.

Regretfully I could not find this thread. Do you have a link? Does it produce results like the current way? How can you limit it to use the minimum null count possible? Else a large number of linkages and even exponential explosion will seem to result for sentences that are not very short, a thing that may make it hard to select a big enough set of minimal null count (for sorting it and getting then the lowest metric).

ampli avatar Dec 21 '18 23:12 ampli

How can you limit it to use the minimum null count possible?

It seems it is still possible. I will test that.

ampli avatar Dec 21 '18 23:12 ampli

NULL

I could not find this thread

Perhaps I got the idea, and just just forgot to post it.

produce results like the current way?

Well, clearly not. It would allow the elimination of the null-word-count, which should both simplify the code, and also speed parsing in the non-null case.

explosion

If NULL is set to LENGTH-LIMIT=1 there should not be an explosion. Its hard for me to imagine that it would be slower than what we currently do.

select a set ... sorting it and getting then the lower metric

Hmm Or maybe there is an explosion. My intuition is not providing any clear answer, right now. Seems like it should work, without much of a problem, but if it doesn't, I'd be curious to know why.

linas avatar Dec 22 '18 00:12 linas

wait a few days more

OK.

linas avatar Dec 22 '18 00:12 linas

For long sentences mk_parse_set() may take very long time (even many minutes). I think I can fix that, and thought that maybe such a fix can be included in 5.6.0.

I just sent PR #868 to fix that. I have a few more other fixes that I would like to send too.

ampli avatar Dec 22 '18 22:12 ampli

3. For (1), I use something like: sed 's/verbosity=/d' data/en/corpus-fix-long,batch | link-parser -v=2 For changing parse options I have to add more components to the sed arguments, like: sed '/verbosity=/d;/limit=1000/d' data/en/corpus-fix-long.batch | link-parser -v=2 -limit=10000 I thought of adding new kind of argument style to overdrive batch variable settings, something like: link-parser -verbosity:=2 -limit:=10000 < data/en/corpus-fix-long,batch Variables so assigned will ignore trying to assign them by regular =.

I encountered a problems in implementing that: It should still be possible to change such "read-only" variables from an interactive session. But since files can be piped or read from stdin, an interactive session should be defined as stdin from a tty, and this is slightly not general (but may still be useful). Another problem is that it is needed to pass an additional argument (stdin) all the way from special_command() to x_issue_special_command() only for that purpose.

ampli avatar Dec 25 '18 13:12 ampli

"read-only" variables from an interactive session.

Do you really need this? It seems like this will only make the code more complicated; it requires more documentation for the user to read and understand. And using sed is already a reasonable work-around; its not too hard, its generic, it's not terribly verbose, and most(?) sophisticated users will be able to use it without confusion about what is actually happening ...

linas avatar Dec 27 '18 22:12 linas

The documentation of such a feature can be included only in the "debug" documentation. I already automated anything by scripts, and my benchmark script already contains a similar feature. So for now I will just add it to scripts that need it (e.g. also to the one that invokes link-parser).

ampli avatar Dec 27 '18 23:12 ampli