link-grammar
link-grammar copied to clipboard
new aspell breaks python unit test
I have this:
$ aspell --version
@(#) International Ispell Version 3.1.20 (but really Aspell 0.60.8)
and get this result:
link-grammar: Info: Library version link-grammar-5.10.1. Enter "!help" for help.
linkparser> the seasand lakes are hot
Found 50 linkages (50 had no P.P. violations)
Linkage 1, cost vector = (UNUSED=0 DIS= 0.05 LEN=11)
+--------------->WV-------------->+
+----------->Wd------------+ |
| +--------Dmc--------+ |
| | +-----A----+--Spx-+--Pa-+
| | | | | |
LEFT-WALL the seasonal[~].a lakes.n are.v hot.a
Press RETURN for the next linkage.
linkparser>
Linkage 2, cost vector = (UNUSED=0 DIS= 0.05 LEN=13)
+------------------>WV------------------>+
+---------->Wd-----------+------Spx------+
| +--Dmc-+<--SJlp<--+->SJrp->+ +--Pa-+
| | | | | | |
LEFT-WALL the seas[&].n and[&].j-n lakes.n are.v hot.a
The python unit test parses-pos-spell-en.txt
is expecting the second parse and fails when the first one arrives.
I also use this exact aspell
version in my system, and it also returns "seasoned" first:
$ aspell -a LG (master $% u=)
@(#) International Ispell Version 3.1.20 (but really Aspell 0.60.8)
seasand
& seasand 34 0: seasoned, sea sand, sea-sand, seas and, seas-and, seasons, season, Zealand, peasant, sealant, seasonal, Sand, sand, season's, send, Seaside, sassed, seaside, Susan, stand, reascend, reasoned, seesawed, sensed, Susana, ceased, resend, savant, secant, second, pheasant, ceasing, servant, Susan's
However, using the latest master (d57493eed849eccd2c7ff6220d5997b24c2c7766) I get:
link-grammar: Info: Dictionary found at ./data/en/4.0.dict
link-grammar: Info: Dictionary version 5.10.0, locale en_US.UTF-8
link-grammar: Info: Library version link-grammar-5.10.1. Enter "!help" for help.
linkparser> the seasand lakes are hot
Found 26 linkages (26 had no P.P. violations)
Linkage 1, cost vector = (UNUSED=0 DIS= 0.05 LEN=13)
+------------------>WV------------------>+
+---------->Wd-----------+------Spx------+
| +--Dmc-+<--SJlp<--+->SJrp->+ +--Pa-+
| | | | | | |
LEFT-WALL the seas[&].n and[&].j-n lakes.n are.v hot.a
Press RETURN for the next linkage.
linkparser>
Linkage 2, cost vector = (UNUSED=0 DIS= 0.15 LEN=16)
+------------------>WV------------------>+
+--------------->Wd---------------+ |
| +------------Dmc-----------+ |
| | +---------AN--------+ |
| | | +----AN---+--Spx-+--Pa-+
| | | | | | |
LEFT-WALL the sea[&].s sand[&].n-u lakes.n are.v hot.a
Press RETURN for the next linkage.
linkparser>
Linkage 3, cost vector = (UNUSED=0 DIS= 0.15 LEN=16)
+------------------->WV------------------->+
+---------------->Wd----------------+ |
| +-------------Dmc------------+ |
| | +---------AN---------+ |
| | | +----AN---+--Spx-+--Pa-+
| | | | | | |
LEFT-WALL the sea[&].n-u sand[&].n-u lakes.n are.v hot.a
Press RETURN for the next linkage.
linkparser>
Linkage 4, cost vector = (UNUSED=0 DIS= 0.25 LEN=15)
+------------------>WV------------------>+
+--------------->Wd---------------+ |
| +------------Dmc-----------+ |
| | +----AN---+----AN---+--Spx-+--Pa-+
| | | | | | |
LEFT-WALL the sea[&].s sand[&].n-u lakes.n are.v hot.a
Press RETURN for the next linkage.
linkparser>
Linkage 5, cost vector = (UNUSED=0 DIS= 0.25 LEN=15)
+------------------->WV------------------->+
+---------------->Wd----------------+ |
| +-------------Dmc------------+ |
| | +----AN----+----AN---+--Spx-+--Pa-+
| | | | | | |
LEFT-WALL the sea[&].n-u sand[&].n-u lakes.n are.v hot.a
Press RETURN for the next linkage.
linkparser>
Linkage 6, cost vector = (UNUSED=0 DIS= 0.55 LEN=11)
+---------------->WV--------------->+
+------------>Wd-------------+ |
| +---------Dmc---------+ |
| | +-----A-----+--Spx-+--Pa-+
| | | | | |
LEFT-WALL the seasoned[~].v-d lakes.n are.v hot.a
[...]
Note that the number of linkages and the metrics are different. I cannot understand that. Any idea?
That's "seasonal" with an L ...
As you can see, on my system aspell
also returns "seasonal".
Using
link-parser -v=9 -debug=guess_misspelled_word
reveals it is not being used (on my system):
linkparser> the seasand lakes are hot
Trace: guess_misspelled_word: spellcheck_suggest for seasand:
- sea sand
- sea-sand
- seas and
- seas-and
- seasoned
#### Finished tokenizing (8 tokens)
But it also reveals an assumption in the code is not correct (it assumes the run-on splits are first) so the result may depend on the order that is maybe different for different dictionaries (aspell dump dicts
).
Another bug (not related to this issue) is that increasing the spell_guess
value (!spell=N
) has no effect here (as can be seen from link-parser -v=9 -debug=guess_misspelled_word spell=999
with this sentence).
I will fix these bugs.
I get this:
$ link-parser -v=9 -debug=guess_misspelled_word
verbosity set to 9
debug set to guess_misspelled_word
link-grammar: Info: Dictionary found at ./data/en/4.0.dict
link-grammar: Info: Dictionary version 5.10.0, locale en_US.UTF-8
link-grammar: Info: Library version link-grammar-5.10.1. Enter "!help" for help.
linkparser> the seasand lakes are hot
Trace: guess_misspelled_word: spellcheck_suggest for seasand:
- seasoned
- sea sand
- sea-sand
- seas and
- seas-and
- seasons
- season
- Zealand
- peasant
- sealant
- seasonal
- Sand
- sand
- season's
- send
- sassed
- seaside
- Susan
- stand
- reascend
- reasoned
- seesawed
- sensed
- Susana
- ceased
- resend
- savant
- secant
- second
- pheasant
- ceasing
- servant
- Susan's
#### Finished tokenizing (8 tokens)
...
Found 50 linkages (50 had no P.P. violations)
So you got five spelling alternates, I got something like 30, by default.
and FWIW:
linkparser> !help spell
spell (integer) - Up to this many spell-guesses per unknown word
Current value: 7
Default value: 7
and neither of us are getting 7 alternates.
My intention with spell-correction was that it would consider all the run-on splits and the first N single-word corrections (!spell=N
).
This is based on the assumption that the spell-correction library gives the best matches first.
BTW, my alternative tokenizer idea can find the run-on splits by itself, because it is able to split words according to the tokens in the dict (for morphology splits) using a modified dictionary lookup method (but I have never completed it).
BTW, I'm checking an idea that could improve spell correction and also solve other current problems - use all the alternative words and inspect the parse tree (the one that links are extracted from) to find out which parses to use and which one to skip. Such a method can also be used to overcome combinatorial explosions.
This is based on the assumption that the spell-correction library gives the best matches first.
I don't think that's right. I think it provides matches with the fewest number of changes, first. These are sometimes terrible fixes.
combinatorial explosions
The combinatorial explosions that bother me the most, these days, are from the auto-generated dictionaries. These are unusual in several ways (depending on which style I work with) In one case, there are tens of thousands of links types (or more) and some common words have a hundred thousand disjuncts (or more). (Basically, these dicts have one link-type per word-pair.) Except I don't quit remember, if this dict ran well, or not.
The other kind of dict had a dozen different <UNKNOWN-WORD>.x
constructions, each having a huge number of disjuncts. The regex for UNKNOWN-WORD
matches anything, so this lead to a huge combinatoric explosion. This is almost the same combinatoric explosion as for the generator - although the words are fixed, there are just vast numbers of possible linkages.
This is based on the assumption that the spell-correction library gives the best matches first.
I don't think that's right. I think it provides matches with the fewest number of changes, first. These are sometimes terrible fixes.
By best I meant more similar to the given word, not more appropriate grammatically. I think this has a good chance to match the intended word when using a "sufficient" number of first suggestions. If you use all the suggestions you may approach a combinatorial explosion and also get many "garbage" sentences that use words that are very far from the original one, which I think is worse.
My idea for preventing combinatorial explosion is related to the desirability to provide the lowest-metric linkages first. When there is a combinatorial explosion, the current selection algo usually fails to find the lowest metric (by far). The idea is to disregard paths in the parse-info graph that lead to high metrics. The current approach is to do it before the linkage, like using lower disjunct cost, and not generating low-chance alternatives. But there are several ways it can be done after the linkage (like omitting higher-cost disjuncts, selecting a lower-metric path from several parallel paths, and other ways).
When generating, even if you don't have disjunct costs you still have the link-length metric and maybe other metrics can be invented.
But there are several ways it can be done after the linkage
Go for it! Combinatorial explosions are a generically difficult problem in AI. Solving it well in the LG context is a very useful thing to do. So I like that idea!
linkparser> the seasand lakes are hot
Trace: guess_misspelled_word: spellcheck_suggest for seasand:
- sea sand
- sea-sand
- seas and
- seas-anda similar
- seasoned
It turned out my library used Hunspell and not Aspell... In any case, my fix works for both and is hopefully now independent of the correction list order, provided it includes the tested run-on split. Tests that check single-word corrections directly in the main tests.py
code may also need a similar fix but due to the particular tested misspelling, there is a reduced chance for a problem.
The combinatorial explosions that bother me the most, these days, are from the auto-generated dictionaries. These are unusual in several ways (depending on which style I work with) In one case, there are tens of thousands of links types (or more) and some common words have a hundred thousand disjuncts (or more). (Basically, these dicts have one link-type per word-pair.) Except I don't quit remember, if this dict ran well, or not.
build_disjucts()
doesn't cope well with expressions that generate a very big number of disjunct. I planed a total overhaul of its code. But I have a fix that reduces the computational complexity of a bottleneck and maybe it is better not to wait until the total overhaul (so I will try to send a PR ASAP).
This issue contains several ideas that may be separated into new issues (or maybe discussions?):
- Reduce combinatorial explosion by selective extraction from the parse set (
struct extractor_s
). - Handle the spell-guess problem caused by pre-selecting a limited number of potential guesses by selecting a limited number when generating the linkages (see (1)).
- Reduce the computational and memory complexity of
build_disjucts()
.