language-learning
language-learning copied to clipboard
Using ats (@) and periods (.) for suffixes in Pre-Cleaner, MST-Parser, Grammar Learner and Link Grammar
Few problems:
- During iterative grammar learning, tagging words in input corpus and input parses may face ambiguity if the words with ats (@) in parses and corpus are translated to words with Link Grammar (LG) suffixes using period (.).
- Emails with inner ats (@) are "corrupted" after Grammar Learner (GL) with changing the ats to periods (.).
- There is some problem (TO BE EXPLAINED WITH DATA REFERENCES by @alexei-gl ) about
up.'and
иup@'and
in Grammar Tester (GT). - When the input corpus/parses contain words with ats they are not recognised by GT (because they are stored with periods) which decrease F1 metric.
@glicerico , do you think that using the period (.) in WSD process and nod re-coding periods to ats by GL could eliminate all of the the problems and wouldn't solve other problems in MST-Parsing?
Item 3 sample is located at http://langlearn.singularitynet.io/data/aglushchenko_parses/suffix-problem/ . The above mentioned token can be easily found in the dictionary file rule.
Looks like the problem with up.'and и up@'and is not the akolonin@Ubuntu-1604-xenial-64-minimal:/home/aglushchenko/data/parses/suffix-problem$ grep -P ".'" dict_20C_2019-01-28_0006.4.0.dict | wc -l 1 akolonin@Ubuntu-1604-xenial-64-minimal:/home/aglushchenko/data/parses/suffix-problem$ grep -P "up.'and" dict_20C_2019-01-28_0006.4.0.dict | wc -l 1 grep -P "up.'and" test-corpus-06.txt.raw (dove)(,)(and)(flew)(up.'and)(into)(the)(air)(.)] grep -P ".'" test-corpus-06.txt.raw (dove)(,)(and)(flew)(up.'and)(into)(the)(air)(.)] grep -P ".\w" test-corpus-06.txt.raw | grep -v Found | grep -v Link(the)(man)(at)(the)(other)(end)(of)(them)(..y)] (as)(her)(..y)]
@glicerico - in the version MST-parsed that you are crafting now, can we have MST-Parser configured so it is not breaking words with inner period?
@akolonin , the new tokenizer-less version of the observer and MST-parser only splits by spaces, so this should not be a problem.