language-learning icon indicating copy to clipboard operation
language-learning copied to clipboard

Unsupervised Parser Challenge for Gutenberg Children corpus

Open akolonin opened this issue 5 years ago • 0 comments

The goal of the challenge is to have unsupervisedly trained parser to create parses approximating "expected" English parses to the best extent - using cleaned Gutenberg Children corpus data as an input and Link Grammar English parses in three forms as a reference.

Input: http://langlearn.singularitynet.io/data/cleaned/English/Gutenberg-Children-Books/lower_LGEng_token/ (that is "cleaned" Gutenberg Children corpus data tokenized with Link Grammar English tokenization rules)

References:

  1. http://langlearn.singularitynet.io/data/parses/English/Gutenberg-Children-Books/LG5.5.1/capital/parses/ (the above is "bronze standard" - the corpus above parsed with Link Grammar English dictionary, with tokenization done in slightly different way which can be ignored when comparing results)
  2. http://langlearn.singularitynet.io/data/parses/English/Gutenberg-Children-Books/test/GC_LGEnglish_noQuotes_fullyParsed.ull (the above is "silver standard" - the previous parses gathered in one file, with all sentence parses selected i one file, where all sentences are 100% parsed with Link Grammar English dictionary and have no any direct speech fragments)
  3. http://langlearn.singularitynet.io/data/parses/English/Gutenberg-Children-Books/test/GC_LGEnglish_noQuotes_manual.ull (the above is "gold standard" - the previous parses with 200+ sentences randomly selected and reviewed by human with the links validated)

Requirements:

  1. The unsupervisedly trained parser should be trained on the input corpus following the same tokenization, assuming the space is word separator and double linefeed is sentence separator.
  2. The unsupervisedly trained parser should be trained on sentence basis, with no mutual impact from adjacent sentences
  3. The output parses for each of the reference files should have file names identical to those in the reference data
  4. The lower/capital case should be ignored as evaluation process will be ignoring the cases
  5. If the parser provides parses in "phrase structure grammar" (PSG) structure (linking words as well as compound phrases, like http://demo.chaoticlanguage.com/), unlikely to "link grammar" structure (linking only words), the "dependency-grammar" parses should somehow converted to "link grammar" structure
  6. The sample code for writing parses in ULL format used by reference parses is provided as follows:
  • Scheme: https://github.com/singnet/learn/blob/1b7220f066866e9ada13c96376ab7f87ee53a1aa/run-poc/redefine-mst-parser.scm#L148
  • Java: https://github.com/aigents/aigents-java/blob/master/src/main/java/net/webstructor/gram/main/LexStructor.java#L548
  1. The links from LEFT-WALL in the expected parses may be ignored and not produced because links from LEFT-WALL and links to ending period will be not involved in evaluation of the results.

Other information:

  • Sample parser code in Scheme https://github.com/singnet/learn
  • Sample parser code in Java can be found here: https://github.com/aigents/aigents-java/blob/master/src/main/java/net/webstructor/gram/main/LexStructor.java#L649

akolonin avatar Jun 05 '19 10:06 akolonin