link-grammar icon indicating copy to clipboard operation
link-grammar copied to clipboard

anysplit issues.

Open linas opened this issue 8 years ago • 52 comments

consider this:

link-parser amy
linkparser> !morp
linkparser> adsfasdfasdfasdf

this attempts the case wit one part (no splits) and with 3 parts (two splits) it never outputs any attempts to split into two parts.

Editing data/amy/4.0.affix and changing 3: REGPARTS+; to 4: REGPARTS+; never generates splits into 4 parts.

I tried setting "w|ts|pts|pms|pmms|ptts|ps|ts": SANEMORPHISM+; but this seemed to have no effect.

linas avatar Jan 25 '17 20:01 linas

In addition to fixing anysplit.c, I fixed only the amy dict... I didn't look at the any dict. I will try to fix it too.

But note that 4 parts are getting split to: pref= stem.= =suf1 =suf2 To make it in another way, I need exact directions on how to mark them.

ampli avatar Jan 25 '17 20:01 ampli

sorry, I meant "amy".

linas avatar Jan 25 '17 20:01 linas

I took a glance at any. Its dict is not designed for morphology at all. So maybe no change is needed, and you need to test the amy dict with more than 3 parts.

ampli avatar Jan 25 '17 20:01 ampli

I just tried "w|ts|pts|pms|pmss|ptss|ps|ts": SANEMORPHISM+; i.e. using two suffixes, togehter with 4: REGPARTS+; and that does not seem to fix anything.

Side question: for Hebrew, if I had to split a word into all of its morphological components, how many pieces might it have (in the common cases)? I get the impression that almost every letter could be a morpheme by itself; is 6 enough, or would more be needed?

linas avatar Jan 25 '17 20:01 linas

The problem is that the current amy/4.0.affix in the repository is not the version that I included in PR #481. If you replace it with the version I provided, then it gets split fine as intended.

EDIT: My error. The file at master now also works fine with 3 parts. The bug is now with > 3 parts...

ampli avatar Jan 25 '17 21:01 ampli

Side question: for Hebrew, if I had to split a word into all of its morphological components, how many pieces might it have (in the common cases)? I get the impression that almost every letter could be a morpheme by itself; is 6 enough, or would more be needed?

In the common case it is up to 4 pieces a start of word. For possibility demo, people constructed also a 5 pieces prefix. So 5 is the answer for prefixes. Only certain letters can be included in such a prefix. Each such piece consists of 1-3 characters. There are about 12 such strings (depending on how you count them). Of course it is very common that what can be looked as prefix is actually an integral part of a word, and also an isolated word may commonly several meaning, depending on how many pieces you consider as a prefix andhow many as integral part of the word (creating a vast ambiguity). These start peices have concatenated morphology (with a slight twist that I have not mentioned).

The end of a regular word can also include some (totally another) morphemes (usually 1-2 letters). I think up to 2.

Verb inflections have their own different prefixes/suffixes. There morphology is not concatenative but, interestingly, there is a concatenative approximation for them (if you use a kind of an artificial "base").

(Note also that you have hard time to conclude anything from derivational morphology - no definite rules in t.)

But how you are going to handle language which have different codepoints to same letters, depending on their position in the word? On first glance, this seems to ruin morphology conclusions unless you note that fact (e.g. by preprocessing, the equivalent of lowercasing the first letter in English).

ampli avatar Jan 25 '17 22:01 ampli

I have just said:

The problem is that the current amy/4.0.affix in the repository is not the version that I included in PR #481. If you replace it with the version I provided, then it gets split fine as intended.

This is true for 3 parts, as appear in PR #481. When using the correct amy/4.0.affix, then indeed all seems to be fine.

But when I change it to 4, I also get a problem. I'll send a PR to correct it...

EDIT: The file at master now also works fine with 3 parts. The bug is now with > 3 parts...

ampli avatar Jan 25 '17 22:01 ampli

The actual problem is because 4 parts and more are currently translated to "multi suffix". I.e., adsfasdfasdfasdf can be broken as: adsf= asdfa.= =sdfa =sdf But amy/4.0.dict doesn't provide a way for that to have a linkage!

It can be fixed in several ways, all need a dict modification:

  1. Provide me with another scheme to mark more than 3 parts.
  2. You can use a marking for middle morphemes, done solely in the dict: =SUF.= These middle morphemes can be linked to stems (if they have {@LL+}) or to previous middle morphemes, or to both, as you like (the same is said for "real" suffixes, i.e. the last token).

I think option (2) is reasonable.

ampli avatar Jan 25 '17 22:01 ampli

On Wed, Jan 25, 2017 at 4:12 PM, Amir Plivatsky [email protected] wrote:

Side question: for Hebrew, if I had to split a word into all of its morphological components, how many pieces might it have (in the common cases)? I get the impression that almost every letter could be a morpheme by itself; is 6 enough, or would more be needed?

In the common case it is up to 4 pieces a start of word. For possibility demo, people constructed also a 5 pieces prefix. So 5 is the answer for prefixes. Only certain letters can be included in such a prefix. Each such piece consists of 1-3 characters. There are about 12 such strings (depending on how you count them). Of course it is very common that what can be looked as prefix is actually an integral part of a word, and also an isolated word may commonly several meaning, depending on how many pieces you consider as a prefix andhow many as integral part of the word (creating a vast ambiguity). These start peices have concatenated morphology (with a slight twist that I have not mentioned).

The end of a regular word can also include some (totally another) morphemes (usually 1-2 letters). I think up to 2.

Grand total sounds like maybe 7-8, plus maybe 3 more for verbs. Whew.

Verb inflections have their own different prefixes/suffixes. There morphology is not concatenative but, interestingly, there is a concatenative approximation for them (if you use a kind of an artificial "base").

I understand its not concatenative; I'm hoping that the more complex syntactic structures are enough to get this done right.

But how you are going to handle language which have different codepoints to same letters, depending on their position in the word? On first glance, this seems to ruin morphology conclusions unless you note that fact (e.g. by preprocessing, the equivalent of lowercasing the first letter in English).

Don't know. I'm also planning on not downcasing, and just seeing what happens. Ask again in a few months. I'm still in very early stages, and just trying to get a map for what the software needs to support.

--linas

linas avatar Jan 26 '17 05:01 linas

I can be fixed in several ways, all need a dict modification:

  1. Provide me with another scheme to mark more than 3 parts.
  2. You can use a marking for middle morphemes, done solely in the dict: =SUF.= These middle morphemes can be linked to stems (if they have {@LL https://github.com/LL+}) or to previous middle morphemes, or to both, as you like (the same is said for "real" suffixes, i.e. the last token).

I think option (2) is reasonable.

Ah! Yes. Clearly, I did not try very hard. My current plan is to use only two link types at the initial stages: one type between morphemes in the same word, and another between words. Later stages will create more link types, including appropriate ones for suffixes, prefixes, etc.

linas avatar Jan 26 '17 06:01 linas

OK, I just fixed something, and now a new issues arises: the original issue: due to the interaction between the morphology/wordgraph and the parser, the vast majority of parses are not sane morphisms. Thus, I re-wrote, in pull reqs #486 and #485 an algo that keeps looping until it finds more sane morphisms. This works ... sort-of. For 4-word sentences, the sane morphisms can be one-in-a-thousand. For 8-12 word sentences, the result can be one sane morphism in a million, which is far far too many to examine, just to find one that works.

So, at the moment, splitting words into three kind-of-ish works, on shorter sentences, but clearly, splitting into even more parts will not work, even on the shortest sentences.

linas avatar Jan 27 '17 21:01 linas

and there is another, new strange issue: a single word, of 8 letters, now gets split into 1 or 2 r 3 parts, more-or-less as it should.

A single word, of 11 letters, is never split: 54 linakges are reported, and all but 3 of them are the same, and completely unpslit! This did work after pull req #481, but something later broke it. Bisecting now

linas avatar Jan 27 '17 21:01 linas

For 8-12 word sentences, the result can be one sane morphism in a million, which is far far too many to examine, just to find one that works.

It is possible to add an alternatives-position-hierarchy comparison to expression_prune(), power_prune(), and even to the fast-matcher (in which it can even be cached), so matching mixed alternatives will be pruned in ahead. Maybe even adding it only to expressio_puning() will drastically increase the density of good results.

Also note that there is a memory leak in partial_init_linkage(). (Strangely, there is also big memory leak with amy+sat - I will look at that.)

ampli avatar Jan 27 '17 22:01 ampli

Ah. sort-of-found it. amy/4.0.affix had 4: REGPARTS+; and when I set back to 3, it behaves more reasonably... but there are still issues. Words with 20 letters will often not split at all, and when they do split, they often split exactly the same way. test case: single "sentence" with long word. Somehow, the randomness is poorly distributed.

linas avatar Jan 27 '17 22:01 linas

With more than 3 parts you need the said dict change...

With a >20 word it looks fine for me.

To see the sampling, use the following: link-parser amy -m -v=9 -debug=anysplit.c

When I tried it with the word abcdefghijklmnopqrstuvwxyz the results looked "reasonable".

ampli avatar Jan 27 '17 22:01 ampli

The whole word is issued only once, as you can see by: link-parser amy -m -v=7 -debug=anysplit.c,issue_word_alternative,flatten_wordgraph,print.c

The fact that many linkages include only it seems as an artifact of the classic parser. With the sat-parser this doesn't happen on my test word abcdefghijklmnopqrstuvwxyz .

ampli avatar Jan 27 '17 23:01 ampli

It is possible to add an alternatives-position-hierarchy comparison to expression_prune(), power_prune(), and even to the fast-matcher (in which it can even be cached), so matching mixed alternatives will be pruned in ahead. Maybe even adding it only to expressio_puning() will drastically increase the density of good results.

Should I ask you to do this? Its kind of low-priority, but its a blocker to more complex morphology work.

Also note that there is a memory leak in partial_init_linkage().

OK, thanks, I think I fixed it in #487

linas avatar Jan 27 '17 23:01 linas

When I tried it with the word abcdefghijklmnopqrstuvwxyz the results looked "reasonable".

I get this:

linkparser> abcdefghijklmnopqrstuvwxyz
Found 1766 linkages (162 of 162 random linkages had no P.P. violations)
Linkage 1, cost vector = (CORP=0.0000 UNUSED=0 DIS= 0.00 LEN=0)

    +------------------ANY------------------+
    |             +------------LL-----------+
    |             |                         |
LEFT-WALL abc[!MOR-STEM].= =defghijklmnopqrstuvwxyz[!MOR-SUFF]

Press RETURN for the next linkage.
linkparser>
Linkage 2, cost vector = (CORP=0.0000 UNUSED=0 DIS= 0.00 LEN=0)

    +----------ANY----------+
    |                       |
LEFT-WALL abcdefghijklmnopqrstuvwxyz[!ANY-WORD]

and then linkage 3,7,12 is same as 1 linkage 4,5,6, 8,9,10,11 is same as 2

and so on, the foist one that's different is linkage 28

linkparser>
Linkage 28, cost vector = (CORP=0.0000 UNUSED=0 DIS= 0.00 LEN=0)

    +---------------------------ANY--------------------------+
    |               +--------PL--------+----------LL---------+
    |               |                  |                     |
LEFT-WALL abcdefgh=[!MOR-PREF] ijk[!MOR-STEM].= =lmnopqrstuvwxyz[!MOR-SUFF]

and then its back to case 1 or 2 until linkage 53 ...

linas avatar Jan 28 '17 00:01 linas

ah, indeed, with SAT, that repeated-issue goes away. Maybe with the classic algo, the random selector keeps hitting the same combination, over and over. I think I can kind-of guess why, its a side-effect of the sparsity.

linas avatar Jan 28 '17 00:01 linas

Flip side: I tried the SAT parser on "Le taux de liaison du ciclésonide aux protéines plasmatiques humaines est en moyenne de 99 %." and after 8+minutes CPU, its still thinking about it. Clearly, there's a combinatoric explosion, so even here, expression_prune(), power_prune(), will be needed. Although I'm confused ... if I think about it naively, adding a sane-morphism check to expression-prune won't fix anything, will it?

What would fix things would be to have a different, unique link type for each different splitting, although that then makes the counting algorithm a bit trickier. I'd have to think about that some more.

linas avatar Jan 28 '17 00:01 linas

If you test this sentence using: link-parser amy -u -m -v=7 -debug=sane_linkage_morphism,print_with_subscript_dot you will see that it very fast generates linkages. However, you will also see that due to the extremely low density of good linkages, all of them are rejected by sane morphism...

In the sat-parser this can be solved by using sane-morphism constraints (and thus make unnecessary the use of the sane_linkage_morphism() function there). Theoretically this is said to make it faster for linkages with potential mixing.

Maybe with the classic algo, the random selector keeps hitting the same combination, over and over.

I tend to think so - this needs checking.

I think I can kind-of guess why, its a side-effect of the sparsity.

Is there a sparsity any more after the bad sane-morphism deletion fix?

if I think about it naively, adding a sane-morphism check to expression-prune won't fix anything, will it?

You are right. The related fix should be applied to power_prune().

What would fix things would be to have a different, unique link type for each different splitting, although that then makes the counting algorithm a bit trickier. I'd have to think about that some more.

Maybe a kind of "checksum" can be done for each linkage and get hashed, enabling rejecting of identical linkages.

ampli avatar Jan 28 '17 02:01 ampli

Here is my plan for mixed alternatives pruning:

  1. Put in each connector the word member from its disjunct.
  2. Use it in possible_connection() to reject match between connectors of different alternatives.

In order not to increase the size of the Connector struct, I thought of sharing tableNext:

struct Connector_struct
{
	...
	union
	{
		Connector * tableNext;
		const Gword **word;
	};
};

This way no changes are needed in the usage of tableNext. Just before the call to power_prune() this word can be assigned.

I have no idea how much overhead this may add to sentences with a few alternatives. For sentences with no alternatives at all this can be skipped, and for sentences with many alternatives I guess it may significantly reduce the linkage time.

ampli avatar Jan 28 '17 02:01 ampli

The sparsity is still there, in the classic algo. For the abcdefghijklmnopqrstuvwxyz test, it counts 1766 linkages, but then, with random sampling, finds only 162 out of 1000 random samples.

If I increase the limit to 2000, then it counts as before, but then later revises that count to 17, because it can exhaustively enumerate all of these. Its sort of a surprising behavior, that the exhaustive attempt revises the count; Its kind-of a feature-bug I guess.

Hashing the linkage sounds like a good idea. But fixing the sparsity at an earlier stage seems more important.

Playing with unions is kind-of like playing with fire. I know that performance is sensative to the connector size, but I don't recall any specifics. At one point, I recall measuring that 2 or 3 or 5 connectors would fit onto one cache line, which at the time seemed like a good thing. Now I'm less clear on this.

There is a paper on link-grammar, describing "multi-colored" LG: connectors would be sorted into different "colored" categories, that would work independently of each other. This allowed the authors to solve some not-uncommon linguistic problem ,although I don't recall quite what. Because they're independent, there's no link-crossing constraints between different colors -- there's no constraints at all between different colors.

Given how Bruce Can was describing Turkish, it seems like it might be a language that would need multi-colored connectors.

Of course, I'm talking about this because perhaps, instead of thinking "this gword/morpheme can only connect to this other gword/morpheme", and enforcing it in possible_connection() -- perhaps a better "mindset" would be to think: "this gword/morphme has a blue-colored connector GM67485+ that can only connect to this other blue-colored connector GM67485-" The end-result is the same, but the change of viewpoint might make it be more natural and naturally extensible... (clearly, it's islands_ok for these blue connectors)

linas avatar Jan 28 '17 05:01 linas

Playing with unions is kind-of like playing with fire.

In that particular case there is no problem in that, as tableNext is not in use after expression_prune().

"this gword/morphme has a blue-colored connector GM67485+ that can only connect to this other blue-colored connector GM67485-"

The problem is that these color labels are scalars, while a token hierarchy position is a vector. See wordgraph_hier_position() and in_same_alternative().

ampli avatar Jan 28 '17 05:01 ampli

Hmm. Well, but couldn't the vector be turned into a hash? Comparing hashes would in any case be faster than comparing vectors. You don't even need to use hashes -- just some unique way of turning that vector into an integer, e.g. just by enumerating all possibilities for that given word.

linas avatar Jan 28 '17 05:01 linas

Hmm. Well, but couldn't the vector be turned into a hash? Comparing hashes would in any case be faster than comparing vectors. You don't even need to use hashes -- just some unique way of turning that vector into an integer, e.g. just by enumerating all possibilities for that given word.

Note that in_same_alternative() doesn't compare between vectors, but between parts of vectors, when the number of vector components that are compared is not known in advance. The comparison stops when 2 vector components are not equal.

Say you have 3 tokens, A, B and C. A can connect to B and to C, but B cannot connect to C. How do you assign numbers that can resolve that via comparisons?

To see how complex connectivity rules between tokens can arise, consider that every token can split again, and the result can split again, creating a hierarchy of splits (the word-graph). But even a sentence with one level of alternatives (like Russian or Hebrew without spell-correction) has these kind of relations - tokens of an alternative can connect to sentence words, but tokens of one alternative cannot connect to tokens of another alternative if both alternatives are of the same word, but can connect to the alternatives of another word. (If these is not clear, try to look at it deeply, and especially think of spell correction that separate words and also gives alternatives, and then all of these tokens get broken to morphemes, each in more than one way.)

To find if tokens a and b are from the same alternative, the algo looks at their hierarchy position vectors Va and Vb. It compares their components one by one, until they don't equal. If even number of components are equal, then the tokens can connect, else they are not.

Usually the position hierarchy vectors have 0 to 4 elements (corresponding to hierarchy depth 0 to 2), so there is no much overhead in their "comparisons". For sentence without alternatives, all the position hierarchy vectors are of length 0. One shortcut that I thought of is to use the same pointer for all equal vectors (like string-set), because tokens with equal vectors can connect - these vectors always contain even number of components (tokens with unequal vectors can connect or not as I mentioned above).

ampli avatar Jan 28 '17 06:01 ampli

its really late at night and I'm about to go to bed so my reply might be off-the-wall, cause I''m not reading or thinking about your code .. .but .. hey: the disjunct is a vector. each connector is a single item in the vector.

I mean, that's kind-of the whole point about all that blather about category theory-- the category of hilbert spaces allows you to define tensors, where you can contract upper and lower (co- and contra-variant) indexes on vectors and tensors.The LG grammar, and most/all categorial grammars are quite similar, just enriching the possibilities of how to contract (match) indexes (connectors): you can force contraction to left or right (hilbert spaces make no left-right distinction) and you can mark some connectors as optional.

Now, the classic LG connectors and connection rules were fairly specific, but I've enriched them with more stuff, and we can enrich or elaborate further. So, step back, and think in abstract terms: instead of calling them vectors, call them a new-weird-disjunct-thing; each vector-component can be called a new-weird-connector-thing, and then some ground rules: can we have more than one link between two morphemes? what else might be allowed or prohibited, generically, for these new things?

My gut intuition is that developing this abstract understanding clarifies the problem, and the solution, even if the resulting C code ends up being almost the same... and the new abstract understanding might give an idea of a better C implementation.

I'll try a more down-to-earth reply sometime later.

linas avatar Jan 28 '17 07:01 linas

I implemented not-same-alternative prune in prune.c. I didn't think about efficiency when doing it, so the change is small.

In possible_connection() before easy_match():

	bool same_alternative = false;
	for (Gword **lg = (Gword **)lc->word; NULL != (*lg); lg++) {
		for (Gword **rg = (Gword **)rc->word; NULL != (*rg); rg++) {
			if (in_same_alternative(*lg, *rg)) {
				same_alternative = true;
			}
		}
	}
	if (!same_alternative) return false;

To support that, the word member of Connector (see a previous post) is initialized at the start of power_prune():

	for (w = 0; w < sent->length; w++) {
		for (d = sent->word[w].d; d != NULL; d = d->next) {
			for (c = d->right; NULL != c; c = c->next)
				c->word = d->word;
			for (c = d->left; NULL != c; c = c->next)
				c->word = d->word;
		}
	}

The same linkages are produced, so I guess this indeed doesn't remove anything that is needed. However, surprisingly, batch run times are not reduced . However, debugging shows the added check returns false on mismatched alternatives (only). Also, the amy test of ``abcdefghijklmnopqrstuvwxyz` got improved: Before:

Found 1766 linkages (162 of 162 random linkages had no P.P. violations)

After:

Found 1496 linkages (289 of 289 random linkages had no P.P. violations)

If this improvement seems worthwhile, I can add efficiency changes to it and send a PR.

ampli avatar Jan 28 '17 23:01 ampli

Next thing to try is maybe add such checks to the fast-matcher.

ampli avatar Jan 28 '17 23:01 ampli

In the previous post I said:

Next thing to try is maybe add such checks to the fast-matcher.

Ok, I tested this too. It doesn't do anything more than has already been done in prune.c.

There is another constraint on alternatives possible connection that is maybe the reason of most of the insane-morphism connections:

A word cannot connect to tokens from different alternatives of another token.

Consider this situation: We have 2 words - A and B. Word B has two alternatives - C and "D E".

A   B
     alt1: C
     alt2: D E

A cannot connect to C and E at the same time.

However, I don't know how to implement an apriori check for that (in power_prune() or elsewhere) ...

ampli avatar Jan 29 '17 01:01 ampli