link-grammar
link-grammar copied to clipboard
Expression structures are insane
The construction of the Exp
structure is nutty, such as the use of the E_list
structure which is a list that is always exactly zero or one long. That is, E_list
is not really needed; it could be folded into the Exp struct. I tried to do that in the left-right
branch in my git repo but ran into all sorts of ugly problems, because the original design was ... insane. Its surprisingly hard to make this seemingly simple change. It currently fails to build clauses; because of the crazy way that optional connectors are handled in the code.
Basically, I give up; lost too much time on this; this bug report is a fix-me-later report Half-done code is in left-right
branch, plus some debugging prints in lr
branch.
Turns out that the SAT solver code really likes the n-ary version of the OR_type expression. If join_alternatives
is altered to create a binary tree, instead of a list, then the SAT solver fails; re-writing SATEncoder::generate_satisfaction_for_expression()
to work with the binary form of the OR_type makes it ugly. The unroll
branch contains this work; if join_alternatives
is unrolled, then SAT fails, without this deeper redesign.
So, maybe this is a bad idea, in the end.
Closing. I spend two days hacking on this, and it was an unsatisfying experience.
Is there any other benefit for doing such a change, besides being more "sane"? BTW, rewriting the SAT-parser way of handling expressions is also in my list, so if I get to it there will not be a need for this conversion there.
I have just did it in a slightly different way. I unified E_list
and Exp
, i.e. added the next
field of E_list
to Exp
and eliminated E_list
. (At this step I have already eliminated the original next
field of Exp
by changing the dict code to use memory-pool (about 5% speedup on reading dicts. There is also a per-sentence speedup. I will send PR soon.) As a result of this unification, much of the code of handling E_list
got removed. Also, expressions then need yet (in addition to saving the original next
of Exp
) less total memory because the Exp pointer (that was in E_list
) got eliminated.
Here is the new Exp_struct
:
/**
* The Exp structure defined below comprises the expression trees that are
* stored in the dictionary. The expression has a type (OR_type, AND_type
* or CONNECTOR_type). If it is not a terminal "operand" points to its
* list of operands (when each of them points to the next one through
* "next"). Else "condesc" is the connector descriptor, when "dir"
* indicates the connector direction.
*/
struct Exp_struct
{
Exp *next; /* Next operand. */
Exp_type type; /* One of three types: AND, OR, or connector. */
char dir; /* The connector connects to the left ('-') or right ('+'). */
bool multi; /* TRUE if a multi-connector (for connector). */
union
{
Exp *operand; /* First operand (for non-terminals). */
condesc_t *condesc; /* Only needed if it's a connector. */
};
double cost; /* The cost of using this expression. */
};
Everything passed my tests, but final cleanup is still needed, including borrowing code from your left-right
branch to make some of my modifications cleaner.
But I don't like the names of the operand
and next
fields.
The operand
filed points to the operand list, i.e. to the first operand, and could be renamed to operand_first
. The next
field points to the next operand, and could be renamed to operand_next
.
Since these names are long and I found that they would cause the need to wrap lines in many places, maybe opd_first
and opd_next
are better (no line wraps in that case).
I also made the union unnamed for simplicity and transparency.
I the same occasion I also flattened the OR and AND purging in exprune.c
so very long operand lists will not need deep recursion.
BTW, it is possible to save 8 bytes in the Exp struct by using float
instead of double, but a preliminary benchmark didn't show any speedup. This idea can be used if there will be a need to add a field to this struct.
If these changes seem fine to you, I will send a PR.
BTW, what was the reason of using double
instead of float
(or even short int
)? (In the Disjunct struct I struggled to add more fields, through unions in yet unpublished changes, and additional 4 bytes could be useful.)
My final proposals for these fields are using full names: operand_first
and operand_next
.
(I will just use short loop variable name in order for for
lines not to be too long.)
insane
Perhaps it would be more accurate to say "hard to understand". But it also seems that maybe something changed along the way; this is unclear. So, maybe 4+ years ago, the Exp structures were made public, so that the Exp's could be imported directly into the AtomSpace. The loops in that code were .. non-sensical. But the code seemed to work .. or, if not, there was a bug that wasn't fixed until last month, when I found and fixed it. When I fixed it, the use of E_list actually "did make sense", so the question is "did it alwyas make sense, or did something change?" (I don't want to know the answer; I didn't write the orig code, and it works now, so all is OK).
The key requirement is that the traversal needs to be either obvious, or carefully explained; so that the opencog code can be updated. (Its in https://github.com/opencog/opencog/blob/master/opencog/nlp/lg-dict/LGDictExpContainer.cc and other files there. They've been extensively hacked over the years, so the code quality might not be as good as what you would expect.
After the E_list removal is incorporated into LG, I can send a PR to fix LGDictExpContainer.cc too
(I looked there and the fix is trivial).
However, I think it is better to do it using my proposal of added Exp API and making the Exp_struct opaque, as I posted in #969. It really belongs here so will post it here again.
Copied from PR #969.
My proposal for fixing LGDictExpContainer.cc
using added API is at the end.
Be aware that this is currently a part of the external API; there's only one user of that API, and it can be changed as needed. As long as its possible to have access to Exp's or anything like them, its OK.
I did the change in a compatible way - my new Exp_struct
is actually what used to be the E_list:
The eliminated E_list struct
includes a next
pointer (for the next operand) and an Exp pointer.
Instead of the Exp pointer I used the full Exp struct, and renamed the said next
pointer to operand_next
as follows:
46 struct Exp_struct
47 {
48 Exp *operand_next; /* Next same-level operand. */
49 Exp_type type; /* One of three types: AND, OR, or connector. */
50 char dir; /* The connector connects to the left ('-') or right ('+'). */
51 bool multi; /* TRUE if a multi-connector (for connector). */
52 union
53 {
54 Exp *operand_first; /* First operand (for non-terminals). */
55 condesc_t *condesc; /* Only needed if it's a connector. */
56 };
57 double cost; /* The cost of using this expression. */
58 };
(Note that the original next
pointer of Exp_struct
, used only for memory management, got eliminated in the the recent commit bae7d591e that introduced a dict Exp memory pool).
For the reference, this is the eliminated E_list_struct
:
struct E_list_struct // renamed to Exp_struct
{
E_list * next; // renamed to operand_next
Exp * e; // the full previous Exp_struct is used instead
};
In addition, for readability I renamed u.l
to operand_first
(seems clearer than operand_list
due to operand_next
).
So the code conversion is rather straightforward, but in the LG lib most of the specific E_list_struct processing code got eliminated.
Proposal: Add additional API call to make Exp future proof to internal changes (that may be needed according to my WIPs and TODOs):
Exp *lg_exp_operand_first(Exp *);
Exp *lg_exp_operand_next(Exp *);
When Exp is an opaque structure pointer.
Exp *lg_exp_operand_first(Exp *); Exp *lg_exp_operand_next(Exp *);
Seems reasonable.I tried to think of better names, but struggled. Maybe this:
Exp* lg_exp_sublist(const Exp *);
Exp* lg_exp_next(const Exp*);
Exp* lg_exp_next(const Exp*);
Is it clear then that it returns the next expression in the current sublist, as opposed to some next expression like the original next
?
It needs to be documented somehow. I think this is called "breadth-first traversal" or something like that. I'm thinking that people who write breadth-first-traversal code have some special vocabulary for that. So google points at https://www.cs.bu.edu/teaching/c/tree/breadth-first/
so maybe lg_exp_breadth_next
or (per wikipedia) lg_exp_adjacent_next
For the other one ... I dunno. lg_exp_sublist_first
or lg_exp_tree_down
or ... lg_exp_level_down
or lg_exp_next_level
or lg_exp_child
It's also not clear why way back in the original code, there was a decision to make it breadth-traversal instead of depth first. Depth-first is almost always easier for recursion. The opencog code converts it into a depth-first.
I think that now breadth-first is fine, since I converted the breadth-recursions to pure iterations (without any stack usage or even extra temporary heap memory usage) because expression chains may be very long in generated dicts and the stack could overflow. On the other hand, the depth (subexpression level) is shallow enough if long operand chains are used so its handling may remain recursive (and the code is clearer and not less efficient if so).
Here is why I think my proposed names are better:
I wrote:
Exp *lg_exp_operand_first(Exp *); Exp *lg_exp_operand_next(Exp *);
You wrote:
Exp* lg_exp_next(const Exp*);
so maybe
lg_exp_breadth_next
or (per wikipedia)lg_exp_adjacent_next
For the other one ... I dunno.
lg_exp_sublist_first
orlg_exp_tree_down
or ...lg_exp_level_down
orlg_exp_next_level
orlg_exp_child
The names I used were derived from the type of data that is actually returned, intentionally without referring to the physical organization of the data, which may vary.
Exp *lg_exp_operand_first(Exp *e);
Here e
is an operator (only) and the returned value is its first operand (NULL for a zeroary operator).
Exp *lg_exp_operand_next(Exp *e);
Here e
is of any type, and the returned value is the next operand (NULL if none).
lg_exp_breadth_next
I think this is not good because it refers to the current Exp implementation and it is better to decouple the names from the implementation details.
Even
lg_exp_sublist_first
orlg_exp_tree_down
or ...
hint on a particular physical organization (e.g. list).
lg_exp_level_down
orlg_exp_next_level
orlg_exp_child
This may be better than your other suggestions because it is a "concept" and not strictly a physical layout, but still it is not explicit as "operand" - it seems to me next_level
and child
are more appropriate for an abstract interface of scanning arbitrary hierarchical data, while "operand" (including "first operand" and "next operand") are specific terms for expressions.
BTW, there is a question which value lg_exp_operand_first()
(in whatever name we will use for it) should return if its argument is terminal. NULL can be used, but NULL is also returned for a zeroary operator. Maybe a sentinel may be used like EXP_ERR which is an invalid pointer so programs that don't check for it will abort (as opposed to just malfunction).
In any case we need to decide the names of these functions (and also the corresponding Exp_struct
field names) so I can send the PR (the SAT E_list
elimination still has unknown problems that I am debugging, but I hope to finish it today).
I was trying to think of names that would make it "obvious" how to use them.
Using NULL is not a problem, because the only terminals for lg_exp_operand_first() is a connector, or NULL-representing-zeroary There is no need for any other terminal in that sequence.
I just wondered what to return if lg_exp_operand_first()
is wrongly called with a connector (or even NULL). If it just returns e->operand_first
, that a crash would happen (immediately if e==NULL
, else eventually).
But it now seems to me there is no need for any check, since if the caller is confused then everything may happen and a special return value in such a case would not help.
Throwing on error is perhaps the biggest advantage that C++ has over C. I wonder why the C standards committee never introduced exceptions.