link-grammar
link-grammar copied to clipboard
dictionary_lookup_list does not look up regexes!
The dictionary_lookup_list() was added to the public dictionary API, so that other users (specifically, the sureal (surface realization) and microplanning modules for sentence-generation) could look up words in the LG dictionary. And that's mostly fine, except that it does not lookup regexes. So:
-- the LG internals need to be jiggered around, so that the public API does lookup regexes.
A related issue is that the public API exposes the Exp structure, which has a bizarre design. It needs to be reworked so that it's cleaner, nicer for the ordinary user. Unfortunately, this is a lot of work.
Do you mean something like that (and the same for db_lookup_list()):
Dict_node * file_lookup_list(const Dictionary dict, const char *s)
{
Dict_node * llist =
rdictionary_lookup(NULL, dict->root, s, true, dict_order_bare);
llist = prune_lookup_list(llist, s);
if (NULL != llist) return llist;
const char *regex_name = match_regex(dict->regex_root, s);
if (regex_name) return file_lookup_list(dict, regex_name);
return NULL;
}
Regarding expression format, what would be considered a nice format?
Is ASCII representation like expression_stringify() better than the current C structure?
Do you mean something like that
Yes.
stringify()
Yes, that would probably be best. Returning a string that resembles the current ascii dictionary format would be best. I don't recall exactly what expression_stringify() prints.
I don't recall exactly what expression_stringify() prints.
It prints the expression in the current dictionary format. So the question now is how to implement it.
I guess we will need a new API.
The easiest way is seems to have 2 API functions, something like:
dictionary_lookup_words() # Return list of words
dictionary_lookup_exp() # Return list of corresponding expressions
Or maybe we can have one function dictionary_lookup() that returns a list word1, exp1, word2, exp2, ... etc...
One question is whether we need to add a third component that indicates if the word has been resolved through a regex.
Or maybe we can use a JSON format, which can be extendable in a compatible way for any future need, and API users can use a JSON library function to decode it, if desired. We can use such a JSON API as a future model for some additional APIs we still need to add.
JSON format
Yes, I like that best! Some of the current API's could/should be provided in json.
Proposal:
const char *linkgrammar_get_dict_word(Dictionary dict, const char *word);
JSON example:
{
"numentries": 5,
"entries":
[
{
"word": "word1.s2",
"regex-name": null,
"idiom": false,
"expr": "(((dWV- or dCV- or dIV-) & {VC+}) or [()])"
},
{
"word": "word2.s3",
...
}
]
}
I don't know if extending it this way may be useful:
"base": "word1"
"subscript": "s2"
"dnf-expr": { {"cost": 0, "expr": [ "A-", "B+", "C+"]}, {"cost": 2, "expr": [ "D-"]}, ...} }
(or specify the - and + connectors to different arrays.)