petrarch2 icon indicating copy to clipboard operation
petrarch2 copied to clipboard

Strange output format for phrase extraction.

Open johnb30 opened this issue 8 years ago • 2 comments

Trying to run some sample data to explore the phrase extraction pieces. I'm using the following data:

{'abc123': {'meta': {'date': '20010101'},
  'sents': {0: {'content': u'At least 37 people are dead after Islamist radical group Boko Haram assaulted a town in northeastern Nigeria .',
    'parsed': u'(ROOT (S (NP (QP (IN AT ) (JJS LEAST ) (CD 37 ) ) (NNS PEOPLE ) ) (VP (VBP ARE ) (ADJP (JJ DEAD ) ) (SBAR (IN AFTER ) (S (NP (JJ ISLAMIST ) (JJ RADICAL ) (NN GROUP ) (NNP BOKO ) (NNP HARAM ) ) (VP (VBD ASSAULTED ) (NP (NP (DT A ) (NN TOWN ) ) (PP (IN IN ) (NP (JJ NORTHEASTERN ) (NNP NIGERIA ) ) ) ) ) ) ) ) (. . ) ) )'}}}}

I then run it through the do_coding routine:

event_dict_updated = petrarch2.do_coding(event_dict, None)

Which yields the following updated dictionary:

{'abc123': {'meta': {'date': '20010101',
   u'verbs': {u'nouns': [([u' PEOPLE'], [u'~PPL'], [[u'~']]),
     ([u' ISLAMIST', u' BOKO HARAM'],
      [u'NGAREBMUS'],
      [[u'~'], (u'NGAREB', [])]),
     ([u' NIGERIA'], [u'NGA'], [(u'NGA', [])])]}},
  'sents': {0: {'content': u'At least 37 people are dead after Islamist radical group Boko Haram assaulted a town in northeastern Nigeria .',
    'parsed': u'(ROOT (S (NP (QP (IN AT ) (JJS LEAST ) (CD 37 ) ) (NNS PEOPLE ) ) (VP (VBP ARE ) (ADJP (JJ DEAD ) ) (SBAR (IN AFTER ) (S (NP (JJ ISLAMIST ) (JJ RADICAL ) (NN GROUP ) (NNP BOKO ) (NNP HARAM ) ) (VP (VBD ASSAULTED ) (NP (NP (DT A ) (NN TOWN ) ) (PP (IN IN ) (NP (JJ NORTHEASTERN ) (NNP NIGERIA ) ) ) ) ) ) ) ) (. . ) ) )'}}}}

There are a couple issues here:

  1. The nested meta, verbs, nouns construct is incorrect.
  2. It's unclear what, exactly, is associated with what. For example, it isn't clear what the [[u'~'], (u'NGAREB', [])]) construct refers to in the sentence.

This isn't relevant to this issue, but it should also be noted that this sentence doesn't code an event even though PETR is clearly identifying potential source and target actors and "assaulted" should be a relevant verb.

cc @philip-schrodt @ahalterman

johnb30 avatar Jun 07 '16 16:06 johnb30

It should also probably be noted that things like:

{(u'---COPLEG', u'---GOV', u'041'): [[u'CALLED'], [u'HAS']],
 u'actorroot': {(u'---COPLEG', u'---GOV', u'041'): [u'', u'']},
 u'actortext': {(u'---COPLEG', u'---GOV', u'041'): [u'deputy ... Congress',
   u'Governor']},
 u'eventtext': {(u'---COPLEG', u'---GOV', u'041'): u'has called'},
 u'nouns': [([u' CONGRESS'], [u'~LEG'], [[u'~']]),
  ([u' DEPUTY', u' CONGRESS'], [u'~COPLEG'], [[u'~'], [u'~']]),
  ([u' GOVERNOR'], [u'~GOV'], [[u'~']]),
  ([u' ADMINISTRATION'], [u'~GOV'], [[u'~']])]}

cause hell if you're trying to dump that result out to JSON since the keys, e.g., (u'---COPLEG', u'---GOV', u'041') aren't a hashable type.

johnb30 avatar Jun 14 '16 17:06 johnb30

Re-upping this since I discovered it again. The tuples as keys thing needs to be fixed ASAP since it's breaking hypnos.

johnb30 avatar Nov 14 '16 19:11 johnb30