link-grammar
link-grammar copied to clipboard
Cost adjust
(EDIT: Since I have already written it in the wrong box, I leave it here...)
It seems there are several problems in this code. Also, there are several other problems related to it.
Regarding condesc_t::cncost
- this is an improper place to put a multi-connector cost: There is a single condesc_t
element for each connector string (i.e. all multi and non-multi connectors with the same connector string share a single condesc_t
element). However, each multi-connector in the dict may have its own cost, and these costs may be different.
There is also a problem to store the cost in the connector due to the 64 bytes limitation.
A multi-connector cost table could be added to disjuncts, but there is a space constraint there too.
So the only good place for multi-connector costs seems to be in Sentence_s
- a hash table for that can be added.
Among the problems that are related:
- It seems there is a need to adjust not only the linkage cost but also the disjunct costs, so reading disjunct costs by the API would lead to consistent results. However, it seems it is not a good idea to update the disjunct costs in their
Disjunct_struct
because this will create a cost mess if relinking is needed (e.g. for implementing phantoms). To that end, anadjust_cost
field could be added toDisjunct_struct
, but again there is a space problem there. - The disjunct string (!as shown by
!disjunct
) should be updated, to be consistent with the adjusted disjunct costs. However, it is not clear how to textually represent multi-matches of multi-connectors. (I also discovered a bug there regarding multi-connectors that has no implications in the current dict/corpora and in your optional multi-connector examples).
Regarding the code, it seems it has several things which are wrong (or that I don't understand). In any case, the code can be simplified and this will save fixing these things. I will add my comments to the code itself.
IT seems I answered in the wrong box and pressed the wrong button! Hopefully, no damage has resulted...
I propose a simpler algo:
- For each word (like now).
- For each multi-connector of a chosen disjunct (like now, but also
left
if needed). - Scan the link array and count the number N of rc (or lc, correspondingly to right/left at (2)) with the same address as found in (2). An address comparison can be used because connectors are copied by reference. Alternatively, a
tracon_id
comparison can be done. - Adjust by (N-1) * the-multi-connector-cost.
The multi-connector costs can be taken from a hash table stored in Sentence_s
. I have no good idea where to store the adjusted disjunct costs (for API fetch if needed and for !disjunct
display) - maybe in an array in Linkage_s
(with num_words
elements).
When you say
The multi-connector costs can be taken from a hash table stored in Sentence_s.
Does this table currently exist, or would it have to be created?
Does this table currently exist, or would it have to be created?
It doesn't exist.
You can use string_id()
on the connector name (with or without the @
) to get a hash table index, but you need to resize this table on the fly as needed.
(BTW, I already used string_id()
in a similar way in several places. Using a generic hash-table function would be better - added to my to-do list...)
Does this table currently exist, or would it have to be created?
It doesn't exist.
OK, thanks. I'm going to keep this open, as a to-do item. Finishing it would not be hard - an afternoon of creating the table and hooking it all up. But nothing depends on it right now, so I don't feel like cluttering up the current code with this new stuff.
FYI, this did eventually become urgent, and an alternate work-around was done in commit 1dcdd159f40d39f149cf9f7650387c284cf6945f which is adequate for now.
In any case I added this to my TODO list.