link-grammar icon indicating copy to clipboard operation
link-grammar copied to clipboard

A "crazy" behaviour in counting linkages...

Open ampli opened this issue 7 years ago • 3 comments

Test sentence: Something you need to do before you watch TV is turn on the TV.

$ link-parser
link-grammar: Info: Dictionary found at ../en/4.0.dict
link-grammar: Info: Dictionary version 5.4.3, locale en_US.UTF-8
link-grammar: Info: Library version link-grammar-5.4.1. Enter "!help" for help.
linkparser> !limit=100
linkparser> Something you need to do before you watch TV is turn on the TV.
Found 432 linkages (81 of 100 random linkages had no P.P. violations)
...
linkparser> !limit=500
linkparser> Something you need to do before you watch TV is turn on the TV.
Found 272 linkages (224 had no P.P. violations)
...

From a first glance something here seems impossible: With a limit of 100 links we get more linkages that with a limit of 500 links?!

Checking it further reveals it works according to our definitions!

  • Limit of 100 is less than the number of linkages, the total number of linkages is reported and not checked for sane-morphism. Only a sample of 100 is actually checked, from which 19 linkages doesn't then pass sane-morphism.
  • With limit of 500, all the linkages are first checked for sane-morphism, and only those that pass it are counted. Some have real P.P.

Question: Is there a way to fix this behaviour? If not, this post may just serve as a documentation of this problem and we can close this issue.

BTW, one can wonder why there are sane-morphism violations in this sentence, that doesn't seem to have alternatives. The answer is that it has: TV has an alternative T V (multi-unit separation). (The new multi-unit handling - PR soon - solves this and worse problems due to multi-unit separation.)

ampli avatar Nov 11 '17 22:11 ampli

I remember seeing this before. I thought I fixed it; I guess not.

This is one reason why using sane-morphism is a problem: most of the work is done during counting.

It might be possible to have a version of sane_morphism that can run during counting. I don't know how much information is available at that point, so I don't know if this is possible or not.

linas avatar Nov 15 '17 01:11 linas

There is already a part of sane-morphism check in form_match_list(). But it is able to check only one of the conditions that leads to "insane-morphism".

In do_count(), maybe another type of check can be added just before the multiplication (that lw-w-rw are from the same alternative). Also maybe the pivot-word alternative-attribute can be passed as an additional argument to do_count().

But checking the null block (if null_count>0) seems much more problematic. The problem is that there can be words in it that are not from the same alternatives as other words, a thing that reduces the number of actual null-words in that block (because some shouldn't be counted). I already have a branch that tries to compute the real number of null-words in a null-block, but I had to add the disjuncts as arguments to do_count(), and for efficiency there is a need to prepare in advance a Gword per-slot table for fast comparisons (which I didn't do and gave up). But the worst problem is that there are many potential null-block (in partial linkages which are rejected), so most of this complex work is a waste. With a separate sane-morphism, however, in the case of null_count>0, much less null-blocks need a check.

BTW, in this occasion I would like to add something I have never mentioned: Saying "words which belong to the same alternative" is a statement about (directional) connectivity. It means these words are (directionally) connected in the word-graph. This is the reason the connectivity attribute of a word is not a scalar, as it carries the information of its position in the graph. The term "same alternative" is relative - same alternative of a common ancestor word. For example, the original sentence words are from the same alternative, when their common ancestor is the sentence itself (a "word" which is separated to subwords by whitespace).

So all the words in a linkage must be connected in the word-graph, and in addition words in a null-block which are not connected to words in the linkage shouldn't be counted as null-words. (I have more to say about null-words, but maybe in another discussion.)

ampli avatar Nov 15 '17 12:11 ampli

I understand what you are saying, I don't yet have any clear vision for what to do about it. I will have to ponder this. In the meantime, if you want to make changes, or not, to do_count, that's OK.

It's actually an interesting problem, from the point of view of geometry. By "geometry" I mean the shape of what's connecting to what. I've made significant updates to the "sheaves" document since what I sent you. I see now that I completely failed to think about alternatives. And null words. I'll have to do that. (I started adding to add a section on how neural nets do this kind of stuff. Its not very informative, but I hope to gain a better overview of how all these different systems accomplish similar things, and why, and where things can be better.) Most recent version is always here: https://github.com/opencog/atomspace/blob/master/opencog/sheaf/docs/sheaves.pdf

linas avatar Nov 15 '17 14:11 linas