language-learning
language-learning copied to clipboard
Explore ILE grammars not being fully used by LG
Two problems:
-
Inconsistent rounding: http://langlearn.singularitynet.io/data/aglushchenko_parses/GCB-NQ-dILEd-MWC-MSL-2019-04-24/GCB-NQ-dILEd-MWC-MSL-summary.txt 0 2 99.60% 1.00 Need to round F1 to 4 decimal places after period so the rounding appears consistent.
-
Some words in the grammar are not parsed in 2-word sentences: http://langlearn.singularitynet.io/data/aglushchenko_parses/GCB-NQ-dILEd-MWC-MSL-2019-04-24/parses/max-sentence-len:2_min-word-count:0/GC_LGEnglish_noQuotes_fullyParsed.ull.ull
Not parsed: [seek-seek] [chuckled] [.] [shawl-straps] [=] [.]
Digging into grammar: http://langlearn.singularitynet.io/data/aglushchenko_parses/GCB-NQ-dILEd-MWC-MSL-2019-04-24/parses/dict_19432C_2019-04-24_0007.4.0.dict
akolonin@Ubuntu-1604-xenial-64-minimal:~/public/aglushchenko_parses/GCB-NQ-dILEd-MWC-MSL-2019-04-24/parses/max-sentence-len:2_min-word-count:0$ grep '-' GC_LGEnglish_noQuotes_fullyParsed.ull.ull
[seek-seek] [chuckled] [.]
[shawl-straps] [=] [.]
rikki-tikki listened [.]
1 rikki-tikki 2 listened
akolonin@Ubuntu-1604-xenial-64-minimal:~/public/aglushchenko_parses/GCB-NQ-dILEd-MWC-MSL-2019-04-24/parses/max-sentence-len:2_min-word-count:0$ grep 'rikki-tikki' dict_19432C_2019-04-24_0007/4.0.dict
"rikki-tikki":
akolonin@Ubuntu-1604-xenial-64-minimal:~/public/aglushchenko_parses/GCB-NQ-dILEd-MWC-MSL-2019-04-24/parses/max-sentence-len:2_min-word-count:0$ grep 'shawl-straps' dict_19432C_2019-04-24_0007/4.0.dict
"shawl-straps":
akolonin@Ubuntu-1604-xenial-64-minimal:~/public/aglushchenko_parses/GCB-NQ-dILEd-MWC-MSL-2019-04-24/parses/max-sentence-len:2_min-word-count:0$ grep 'seek-seek' dict_19432C_2019-04-24_0007/4.0.dict
"seek-seek":
- Four decimal places rounding is done (PR #213)
- "=" is missing in produced dictionary. Although "seek-seek" and "chuckled" are there in the dictionary, there is no disjunct to unify them the way they could be recognized together in the above mentioned sentence. When trying to subset grammar rules for the certain sentence with the help of
dict-transformer
script I get empty rule set.
@OlegBaskov , can the " Although "seek-seek" and "chuckled" are there in the dictionary, there is no disjunct to unify them the way they could be recognized together in the above mentioned sentence. When trying to subset grammar rules for the certain sentence with the help of dict-transformer script I get empty rule set. " be an issue for separate exploration?
Disjunct: AZDAAEYV
:
% AZDA
"seek-seek":
(AAXYAZDA- & AZDAABBO+ & AZDABCPE+) or (ABOYAZDA-) or (ABSEAZDA- & AZDABEYJ+) or (AECMAZDA-) or (AZDAAAMM+) or (AZDAAEYV+)
or (BFFVAZDA- & AZDAANAL+);
% AEYV
"chuckled":
(AAUNAEYV-) or (AAUNAEYV- & AEYVAPEB+) or (AAWPAEYV- & ANYHAEYV- & AEYVAAWZ+) or (ABBOAEYV-) or (ACQSAEYV- & AEYVABAL+) or (ACXMAEYV-) or (ACXMAEYV- & ABBOAEYV- & AEYVBDAS+) or (ACXMAEYV- & AEYVAFXT+) or (ACXMAEYV- & AEYVAOKU+) or (ADLCAEYV-) or (AEDTAEYV-) or (AEDTAEYV- & ADZQAEYV- & AEYVAAQM+) or (AEWHAEYV- & ASXVAEYV- & AEYVBDAS+) or (AEYSAEYV-) or (AEYVAAQM+ & AEYVABBO+) or (AEYVABBO+) or (AEYVBDAS+ & AEYVABBO+) or (AGAZAEYV- & AEYVBDAS+) or (AGPQAEYV-) or (AGXXAEYV-) or (AGXXAEYV- & AEYVAAQM+) or (AJGKAEYV-) or (AMFPAEYV-) or (AMFPAEYV- & AEYVAJGK+) or (AMFPAEYV- & AEYVBEFP+) or (AMLJAEYV- & AEYVAUHC+) or (ANDEAEYV-) or (ANDEAEYV- & AEYVAAQM+) or (ANOTAEYV-) or (ANVFAEYV-) or (ANYHAEYV-) or (ANYHAEYV- & ABBOAEYV-) or (ANYHAEYV- & ABBOAEYV- & AEYVAAQM+ & AEYVABSE+ & AEYVASYE+) or (ANYHAEYV- & ABBOAEYV- & AEYVBDAS+) or (ANYHAEYV- & AEYVAAQM+) or (ANYHAEYV- & AEYVAJKN+) or (ANYHAEYV- & AEYVAOJC+) or (ANYHAEYV- & AEYVBAOH+) or (ANYHAEYV- & AEYVBDAS+ & AEYVABOY+) or (ANYHAEYV- & BAMUAEYV- & AEYVAAQM+) or (ANYHAEYV- & BAMUAEYV- & AEYVABOY+) or (APXMAEYV-) or (APYUAEYV-) or (APZYAEYV-) or (AQECAEYV-) or (AQPTAEYV-) or (ARCAAEYV-) or (ARTSAEYV- & AEYVBDAS+ & AEYVAUHC+) or (ARUCAEYV-) or (ASPVAEYV- & AEYVBCPO+) or (ATHFAEYV- & AEYVAPEB+) or (AUYLAEYV- & AEYVALHH+) or (AVENAEYV-) or (AVENAEYV- & AEYVBDAS+) or (AVVSAEYV- & AEYVAAQM+) or (AWRFAEYV-) or (AXGRAEYV- & AEYVABSE+) or (AYKGAEYV- & AEYVALHH+) or (AYQEAEYV-) or (AYQEAEYV- & AEYVANTT+) or (AYQEAEYV- & AEYVANUF+) or (AZDAAEYV-)
or (AZNEAEYV- & AEYVARCH+ & AEYVAAEF+) or (BACBAEYV-) or (BCSXAEYV- & AEYVAAQM+) or (BDAXAEYV-) or (BDAXAEYV- & AEYVAAQM+) or (BDAXAEYV- & AEYVAHDV+) or (BFJTAEYV-);
http://langlearn.singularitynet.io/data/aglushchenko_parses/GCB-NQ-dILEd-MWC-MSL-2019-04-24/parses/max-sentence-len:0_min-word-count:1/dict_19432C_2019-04-24_0007/4.0.dict
@alexei-gl - please make sure why A) "rule-stripper" misses this rule and fix it if needed B) LG Parser misses this rule and submit issue with all diagnostic files to Amir if needed
Regarding item two at the very beginning of the issue, the sentences mentioned above are not recognized by the induced grammar because at least one word in each sentence is started from uppercase letter in the input corpus while induced dictionary rules are case sensitive and those words are lowercased. For example: Here is the original corpus sentence:
Seek-Seek chuckled .
If the exect sentense is manualy fed to link-parser
the returned parse looks like:
[Seek-Seek] [chuckled] [.]
If lowercased sentence is fed the returned parse is:
+-AZDAAEYV+
| |
seek-seek chuckled [.]
As for the result .ull
file, with specified settings output sentence is converted to lowercase before being written to file. That's why in the result .ull
file we get lowercased:
[seek-seek] [chuckled] [.]
@alexei-gl - can you check if converting GT input to lowercase (in case of using induced grammar) improve the results more than 1% - based on what you have for full GC, GL on LG-English, max_unparsed_words=99, MWC=1(GL/GT), get the difference and review the produced parses with diff so we can decide how to handle that.
LG dictionary library was fixed in PR #250. Results for lowercase input are here: https://docs.google.com/spreadsheets/d/1o-4acGPxkMIS6-xJDxwjqDWAwIt8qx2xPemu14IZRaU/edit?pli=1#gid=1580443221 (see rows 90-99).