link-grammar icon indicating copy to clipboard operation
link-grammar copied to clipboard

Expression structures are insane

Open linas opened this issue 6 years ago • 17 comments

The construction of the Exp structure is nutty, such as the use of the E_liststructure which is a list that is always exactly zero or one long. That is, E_list is not really needed; it could be folded into the Exp struct. I tried to do that in the left-right branch in my git repo but ran into all sorts of ugly problems, because the original design was ... insane. Its surprisingly hard to make this seemingly simple change. It currently fails to build clauses; because of the crazy way that optional connectors are handled in the code.

Basically, I give up; lost too much time on this; this bug report is a fix-me-later report Half-done code is in left-right branch, plus some debugging prints in lr branch.

linas avatar Mar 29 '18 14:03 linas

Turns out that the SAT solver code really likes the n-ary version of the OR_type expression. If join_alternatives is altered to create a binary tree, instead of a list, then the SAT solver fails; re-writing SATEncoder::generate_satisfaction_for_expression() to work with the binary form of the OR_type makes it ugly. The unroll branch contains this work; if join_alternatives is unrolled, then SAT fails, without this deeper redesign.

So, maybe this is a bad idea, in the end.

linas avatar Apr 02 '18 10:04 linas

Closing. I spend two days hacking on this, and it was an unsatisfying experience.

linas avatar Apr 23 '18 20:04 linas

Is there any other benefit for doing such a change, besides being more "sane"? BTW, rewriting the SAT-parser way of handling expressions is also in my list, so if I get to it there will not be a need for this conversion there.

ampli avatar Apr 24 '18 04:04 ampli

I have just did it in a slightly different way. I unified E_list and Exp, i.e. added the next field of E_list to Exp and eliminated E_list. (At this step I have already eliminated the original next field of Exp by changing the dict code to use memory-pool (about 5% speedup on reading dicts. There is also a per-sentence speedup. I will send PR soon.) As a result of this unification, much of the code of handling E_list got removed. Also, expressions then need yet (in addition to saving the original next of Exp) less total memory because the Exp pointer (that was in E_list) got eliminated.

Here is the new Exp_struct:

/**
 * The Exp structure defined below comprises the expression trees that are
 * stored in the dictionary. The expression has a type (OR_type, AND_type
 * or CONNECTOR_type). If it is not a terminal "operand" points to its
 * list of operands (when each of them points to the next one through
 * "next"). Else "condesc" is the connector descriptor, when "dir"
 * indicates the connector direction.
 */
struct Exp_struct
{
	Exp *next;     /* Next operand. */
	Exp_type type; /* One of three types: AND, OR, or connector. */
	char dir;      /* The connector connects to the left ('-') or right ('+'). */
	bool multi;    /* TRUE if a multi-connector (for connector). */
	union
	{
		Exp *operand; /* First operand (for non-terminals). */
		condesc_t *condesc; /* Only needed if it's a connector. */
	};
	double cost;   /* The cost of using this expression. */
};

Everything passed my tests, but final cleanup is still needed, including borrowing code from your left-right branch to make some of my modifications cleaner.

But I don't like the names of the operand and next fields. The operand filed points to the operand list, i.e. to the first operand, and could be renamed to operand_first. The next field points to the next operand, and could be renamed to operand_next. Since these names are long and I found that they would cause the need to wrap lines in many places, maybe opd_first and opd_next are better (no line wraps in that case).

I also made the union unnamed for simplicity and transparency. I the same occasion I also flattened the OR and AND purging in exprune.c so very long operand lists will not need deep recursion.

BTW, it is possible to save 8 bytes in the Exp struct by using float instead of double, but a preliminary benchmark didn't show any speedup. This idea can be used if there will be a need to add a field to this struct.

If these changes seem fine to you, I will send a PR.

BTW, what was the reason of using double instead of float (or even short int)? (In the Disjunct struct I struggled to add more fields, through unions in yet unpublished changes, and additional 4 bytes could be useful.)

ampli avatar Jul 09 '19 08:07 ampli

My final proposals for these fields are using full names: operand_first and operand_next. (I will just use short loop variable name in order for for lines not to be too long.)

ampli avatar Jul 09 '19 13:07 ampli

insane

Perhaps it would be more accurate to say "hard to understand". But it also seems that maybe something changed along the way; this is unclear. So, maybe 4+ years ago, the Exp structures were made public, so that the Exp's could be imported directly into the AtomSpace. The loops in that code were .. non-sensical. But the code seemed to work .. or, if not, there was a bug that wasn't fixed until last month, when I found and fixed it. When I fixed it, the use of E_list actually "did make sense", so the question is "did it alwyas make sense, or did something change?" (I don't want to know the answer; I didn't write the orig code, and it works now, so all is OK).

The key requirement is that the traversal needs to be either obvious, or carefully explained; so that the opencog code can be updated. (Its in https://github.com/opencog/opencog/blob/master/opencog/nlp/lg-dict/LGDictExpContainer.cc and other files there. They've been extensively hacked over the years, so the code quality might not be as good as what you would expect.

linas avatar Jul 11 '19 02:07 linas

After the E_list removal is incorporated into LG, I can send a PR to fix LGDictExpContainer.cc too (I looked there and the fix is trivial).

However, I think it is better to do it using my proposal of added Exp API and making the Exp_struct opaque, as I posted in #969. It really belongs here so will post it here again.

ampli avatar Jul 11 '19 17:07 ampli

Copied from PR #969. My proposal for fixing LGDictExpContainer.cc using added API is at the end.

Be aware that this is currently a part of the external API; there's only one user of that API, and it can be changed as needed. As long as its possible to have access to Exp's or anything like them, its OK.

I did the change in a compatible way - my new Exp_struct is actually what used to be the E_list: The eliminated E_list struct includes a next pointer (for the next operand) and an Exp pointer. Instead of the Exp pointer I used the full Exp struct, and renamed the said next pointer to operand_next as follows:

     46 struct Exp_struct
     47 {
     48         Exp *operand_next; /* Next same-level operand. */
     49         Exp_type type; /* One of three types: AND, OR, or connector. */
     50         char dir;      /* The connector connects to the left ('-') or right ('+'). */
     51         bool multi;    /* TRUE if a multi-connector (for connector). */
     52         union
     53         {
     54                 Exp *operand_first; /* First operand (for non-terminals). */
     55                 condesc_t *condesc; /* Only needed if it's a connector. */
     56         };
     57         double cost;   /* The cost of using this expression. */
     58 };

(Note that the original next pointer of Exp_struct, used only for memory management, got eliminated in the the recent commit bae7d591e that introduced a dict Exp memory pool).

For the reference, this is the eliminated E_list_struct:

struct E_list_struct  // renamed to Exp_struct
{
        E_list * next;    // renamed to operand_next
        Exp * e;           // the full previous Exp_struct is used instead
};

In addition, for readability I renamed u.l to operand_first (seems clearer than operand_list due to operand_next).

So the code conversion is rather straightforward, but in the LG lib most of the specific E_list_struct processing code got eliminated.

Proposal: Add additional API call to make Exp future proof to internal changes (that may be needed according to my WIPs and TODOs):

Exp *lg_exp_operand_first(Exp *);
Exp *lg_exp_operand_next(Exp *);

When Exp is an opaque structure pointer.

ampli avatar Jul 11 '19 17:07 ampli

Exp *lg_exp_operand_first(Exp *); Exp *lg_exp_operand_next(Exp *);

Seems reasonable.I tried to think of better names, but struggled. Maybe this:

Exp* lg_exp_sublist(const Exp *);
Exp* lg_exp_next(const Exp*);

linas avatar Jul 15 '19 03:07 linas

Exp* lg_exp_next(const Exp*);

Is it clear then that it returns the next expression in the current sublist, as opposed to some next expression like the original next?

ampli avatar Jul 15 '19 08:07 ampli

It needs to be documented somehow. I think this is called "breadth-first traversal" or something like that. I'm thinking that people who write breadth-first-traversal code have some special vocabulary for that. So google points at https://www.cs.bu.edu/teaching/c/tree/breadth-first/

so maybe lg_exp_breadth_next or (per wikipedia) lg_exp_adjacent_next

For the other one ... I dunno. lg_exp_sublist_first or lg_exp_tree_down or ... lg_exp_level_down or lg_exp_next_level or lg_exp_child

linas avatar Jul 15 '19 19:07 linas

It's also not clear why way back in the original code, there was a decision to make it breadth-traversal instead of depth first. Depth-first is almost always easier for recursion. The opencog code converts it into a depth-first.

linas avatar Jul 15 '19 19:07 linas

I think that now breadth-first is fine, since I converted the breadth-recursions to pure iterations (without any stack usage or even extra temporary heap memory usage) because expression chains may be very long in generated dicts and the stack could overflow. On the other hand, the depth (subexpression level) is shallow enough if long operand chains are used so its handling may remain recursive (and the code is clearer and not less efficient if so).

ampli avatar Jul 16 '19 12:07 ampli

Here is why I think my proposed names are better:

I wrote:

Exp *lg_exp_operand_first(Exp *);
Exp *lg_exp_operand_next(Exp *);

You wrote:

Exp* lg_exp_next(const Exp*);

so maybe lg_exp_breadth_next or (per wikipedia) lg_exp_adjacent_next

For the other one ... I dunno. lg_exp_sublist_first or lg_exp_tree_down or ... lg_exp_level_down or lg_exp_next_level or lg_exp_child

The names I used were derived from the type of data that is actually returned, intentionally without referring to the physical organization of the data, which may vary.

Exp *lg_exp_operand_first(Exp *e); Here e is an operator (only) and the returned value is its first operand (NULL for a zeroary operator).

Exp *lg_exp_operand_next(Exp *e); Here e is of any type, and the returned value is the next operand (NULL if none).

lg_exp_breadth_next

I think this is not good because it refers to the current Exp implementation and it is better to decouple the names from the implementation details.

Even

lg_exp_sublist_first or lg_exp_tree_down or ...

hint on a particular physical organization (e.g. list).

lg_exp_level_down or lg_exp_next_level or lg_exp_child

This may be better than your other suggestions because it is a "concept" and not strictly a physical layout, but still it is not explicit as "operand" - it seems to me next_level and child are more appropriate for an abstract interface of scanning arbitrary hierarchical data, while "operand" (including "first operand" and "next operand") are specific terms for expressions.

BTW, there is a question which value lg_exp_operand_first() (in whatever name we will use for it) should return if its argument is terminal. NULL can be used, but NULL is also returned for a zeroary operator. Maybe a sentinel may be used like EXP_ERR which is an invalid pointer so programs that don't check for it will abort (as opposed to just malfunction).

In any case we need to decide the names of these functions (and also the corresponding Exp_struct field names) so I can send the PR (the SAT E_list elimination still has unknown problems that I am debugging, but I hope to finish it today).

ampli avatar Jul 18 '19 07:07 ampli

I was trying to think of names that would make it "obvious" how to use them.

Using NULL is not a problem, because the only terminals for lg_exp_operand_first() is a connector, or NULL-representing-zeroary There is no need for any other terminal in that sequence.

linas avatar Jul 18 '19 16:07 linas

I just wondered what to return if lg_exp_operand_first() is wrongly called with a connector (or even NULL). If it just returns e->operand_first, that a crash would happen (immediately if e==NULL, else eventually). But it now seems to me there is no need for any check, since if the caller is confused then everything may happen and a special return value in such a case would not help.

ampli avatar Jul 18 '19 17:07 ampli

Throwing on error is perhaps the biggest advantage that C++ has over C. I wonder why the C standards committee never introduced exceptions.

linas avatar Jul 18 '19 17:07 linas