language-learning icon indicating copy to clipboard operation
language-learning copied to clipboard

Explore ILE grammars not being fully used by LG

Open akolonin opened this issue 5 years ago • 7 comments

Two problems:

  1. Inconsistent rounding: http://langlearn.singularitynet.io/data/aglushchenko_parses/GCB-NQ-dILEd-MWC-MSL-2019-04-24/GCB-NQ-dILEd-MWC-MSL-summary.txt 0 2 99.60% 1.00 Need to round F1 to 4 decimal places after period so the rounding appears consistent.

  2. Some words in the grammar are not parsed in 2-word sentences: http://langlearn.singularitynet.io/data/aglushchenko_parses/GCB-NQ-dILEd-MWC-MSL-2019-04-24/parses/max-sentence-len:2_min-word-count:0/GC_LGEnglish_noQuotes_fullyParsed.ull.ull

Not parsed: [seek-seek] [chuckled] [.] [shawl-straps] [=] [.]

Digging into grammar: http://langlearn.singularitynet.io/data/aglushchenko_parses/GCB-NQ-dILEd-MWC-MSL-2019-04-24/parses/dict_19432C_2019-04-24_0007.4.0.dict

akolonin@Ubuntu-1604-xenial-64-minimal:~/public/aglushchenko_parses/GCB-NQ-dILEd-MWC-MSL-2019-04-24/parses/max-sentence-len:2_min-word-count:0$ grep '-' GC_LGEnglish_noQuotes_fullyParsed.ull.ull
[seek-seek] [chuckled] [.] 
[shawl-straps] [=] [.] 
rikki-tikki listened [.] 
1 rikki-tikki 2 listened
akolonin@Ubuntu-1604-xenial-64-minimal:~/public/aglushchenko_parses/GCB-NQ-dILEd-MWC-MSL-2019-04-24/parses/max-sentence-len:2_min-word-count:0$ grep 'rikki-tikki' dict_19432C_2019-04-24_0007/4.0.dict
"rikki-tikki":
akolonin@Ubuntu-1604-xenial-64-minimal:~/public/aglushchenko_parses/GCB-NQ-dILEd-MWC-MSL-2019-04-24/parses/max-sentence-len:2_min-word-count:0$ grep 'shawl-straps' dict_19432C_2019-04-24_0007/4.0.dict
"shawl-straps":
akolonin@Ubuntu-1604-xenial-64-minimal:~/public/aglushchenko_parses/GCB-NQ-dILEd-MWC-MSL-2019-04-24/parses/max-sentence-len:2_min-word-count:0$ grep 'seek-seek' dict_19432C_2019-04-24_0007/4.0.dict
"seek-seek":

akolonin avatar Apr 25 '19 06:04 akolonin

  1. Four decimal places rounding is done (PR #213)
  2. "=" is missing in produced dictionary. Although "seek-seek" and "chuckled" are there in the dictionary, there is no disjunct to unify them the way they could be recognized together in the above mentioned sentence. When trying to subset grammar rules for the certain sentence with the help of dict-transformer script I get empty rule set.

alexei-gl avatar Apr 30 '19 06:04 alexei-gl

@OlegBaskov , can the " Although "seek-seek" and "chuckled" are there in the dictionary, there is no disjunct to unify them the way they could be recognized together in the above mentioned sentence. When trying to subset grammar rules for the certain sentence with the help of dict-transformer script I get empty rule set. " be an issue for separate exploration?

akolonin avatar May 05 '19 16:05 akolonin

Disjunct: AZDAAEYV:

% AZDA
"seek-seek":
(AAXYAZDA- & AZDAABBO+ & AZDABCPE+) or (ABOYAZDA-) or (ABSEAZDA- & AZDABEYJ+) or (AECMAZDA-) or (AZDAAAMM+) or (AZDAAEYV+) or (BFFVAZDA- & AZDAANAL+);

% AEYV "chuckled": (AAUNAEYV-) or (AAUNAEYV- & AEYVAPEB+) or (AAWPAEYV- & ANYHAEYV- & AEYVAAWZ+) or (ABBOAEYV-) or (ACQSAEYV- & AEYVABAL+) or (ACXMAEYV-) or (ACXMAEYV- & ABBOAEYV- & AEYVBDAS+) or (ACXMAEYV- & AEYVAFXT+) or (ACXMAEYV- & AEYVAOKU+) or (ADLCAEYV-) or (AEDTAEYV-) or (AEDTAEYV- & ADZQAEYV- & AEYVAAQM+) or (AEWHAEYV- & ASXVAEYV- & AEYVBDAS+) or (AEYSAEYV-) or (AEYVAAQM+ & AEYVABBO+) or (AEYVABBO+) or (AEYVBDAS+ & AEYVABBO+) or (AGAZAEYV- & AEYVBDAS+) or (AGPQAEYV-) or (AGXXAEYV-) or (AGXXAEYV- & AEYVAAQM+) or (AJGKAEYV-) or (AMFPAEYV-) or (AMFPAEYV- & AEYVAJGK+) or (AMFPAEYV- & AEYVBEFP+) or (AMLJAEYV- & AEYVAUHC+) or (ANDEAEYV-) or (ANDEAEYV- & AEYVAAQM+) or (ANOTAEYV-) or (ANVFAEYV-) or (ANYHAEYV-) or (ANYHAEYV- & ABBOAEYV-) or (ANYHAEYV- & ABBOAEYV- & AEYVAAQM+ & AEYVABSE+ & AEYVASYE+) or (ANYHAEYV- & ABBOAEYV- & AEYVBDAS+) or (ANYHAEYV- & AEYVAAQM+) or (ANYHAEYV- & AEYVAJKN+) or (ANYHAEYV- & AEYVAOJC+) or (ANYHAEYV- & AEYVBAOH+) or (ANYHAEYV- & AEYVBDAS+ & AEYVABOY+) or (ANYHAEYV- & BAMUAEYV- & AEYVAAQM+) or (ANYHAEYV- & BAMUAEYV- & AEYVABOY+) or (APXMAEYV-) or (APYUAEYV-) or (APZYAEYV-) or (AQECAEYV-) or (AQPTAEYV-) or (ARCAAEYV-) or (ARTSAEYV- & AEYVBDAS+ & AEYVAUHC+) or (ARUCAEYV-) or (ASPVAEYV- & AEYVBCPO+) or (ATHFAEYV- & AEYVAPEB+) or (AUYLAEYV- & AEYVALHH+) or (AVENAEYV-) or (AVENAEYV- & AEYVBDAS+) or (AVVSAEYV- & AEYVAAQM+) or (AWRFAEYV-) or (AXGRAEYV- & AEYVABSE+) or (AYKGAEYV- & AEYVALHH+) or (AYQEAEYV-) or (AYQEAEYV- & AEYVANTT+) or (AYQEAEYV- & AEYVANUF+) or (AZDAAEYV-) or (AZNEAEYV- & AEYVARCH+ & AEYVAAEF+) or (BACBAEYV-) or (BCSXAEYV- & AEYVAAQM+) or (BDAXAEYV-) or (BDAXAEYV- & AEYVAAQM+) or (BDAXAEYV- & AEYVAHDV+) or (BFJTAEYV-);

http://langlearn.singularitynet.io/data/aglushchenko_parses/GCB-NQ-dILEd-MWC-MSL-2019-04-24/parses/max-sentence-len:0_min-word-count:1/dict_19432C_2019-04-24_0007/4.0.dict

OlegBaskov avatar May 06 '19 05:05 OlegBaskov

@alexei-gl - please make sure why A) "rule-stripper" misses this rule and fix it if needed B) LG Parser misses this rule and submit issue with all diagnostic files to Amir if needed

akolonin avatar May 06 '19 08:05 akolonin

Regarding item two at the very beginning of the issue, the sentences mentioned above are not recognized by the induced grammar because at least one word in each sentence is started from uppercase letter in the input corpus while induced dictionary rules are case sensitive and those words are lowercased. For example: Here is the original corpus sentence:

Seek-Seek chuckled .

If the exect sentense is manualy fed to link-parser the returned parse looks like:

[Seek-Seek] [chuckled] [.]

If lowercased sentence is fed the returned parse is:

    +-AZDAAEYV+
    |         |
seek-seek chuckled [.]

As for the result .ull file, with specified settings output sentence is converted to lowercase before being written to file. That's why in the result .ull file we get lowercased:

[seek-seek] [chuckled] [.]

alexei-gl avatar Jul 08 '19 00:07 alexei-gl

@alexei-gl - can you check if converting GT input to lowercase (in case of using induced grammar) improve the results more than 1% - based on what you have for full GC, GL on LG-English, max_unparsed_words=99, MWC=1(GL/GT), get the difference and review the produced parses with diff so we can decide how to handle that.

akolonin avatar Jul 08 '19 07:07 akolonin

LG dictionary library was fixed in PR #250. Results for lowercase input are here: https://docs.google.com/spreadsheets/d/1o-4acGPxkMIS6-xJDxwjqDWAwIt8qx2xPemu14IZRaU/edit?pli=1#gid=1580443221 (see rows 90-99).

alexei-gl avatar Aug 07 '19 02:08 alexei-gl