petrarch2
petrarch2 copied to clipboard
Strange output format for phrase extraction.
Trying to run some sample data to explore the phrase extraction pieces. I'm using the following data:
{'abc123': {'meta': {'date': '20010101'},
'sents': {0: {'content': u'At least 37 people are dead after Islamist radical group Boko Haram assaulted a town in northeastern Nigeria .',
'parsed': u'(ROOT (S (NP (QP (IN AT ) (JJS LEAST ) (CD 37 ) ) (NNS PEOPLE ) ) (VP (VBP ARE ) (ADJP (JJ DEAD ) ) (SBAR (IN AFTER ) (S (NP (JJ ISLAMIST ) (JJ RADICAL ) (NN GROUP ) (NNP BOKO ) (NNP HARAM ) ) (VP (VBD ASSAULTED ) (NP (NP (DT A ) (NN TOWN ) ) (PP (IN IN ) (NP (JJ NORTHEASTERN ) (NNP NIGERIA ) ) ) ) ) ) ) ) (. . ) ) )'}}}}
I then run it through the do_coding
routine:
event_dict_updated = petrarch2.do_coding(event_dict, None)
Which yields the following updated dictionary:
{'abc123': {'meta': {'date': '20010101',
u'verbs': {u'nouns': [([u' PEOPLE'], [u'~PPL'], [[u'~']]),
([u' ISLAMIST', u' BOKO HARAM'],
[u'NGAREBMUS'],
[[u'~'], (u'NGAREB', [])]),
([u' NIGERIA'], [u'NGA'], [(u'NGA', [])])]}},
'sents': {0: {'content': u'At least 37 people are dead after Islamist radical group Boko Haram assaulted a town in northeastern Nigeria .',
'parsed': u'(ROOT (S (NP (QP (IN AT ) (JJS LEAST ) (CD 37 ) ) (NNS PEOPLE ) ) (VP (VBP ARE ) (ADJP (JJ DEAD ) ) (SBAR (IN AFTER ) (S (NP (JJ ISLAMIST ) (JJ RADICAL ) (NN GROUP ) (NNP BOKO ) (NNP HARAM ) ) (VP (VBD ASSAULTED ) (NP (NP (DT A ) (NN TOWN ) ) (PP (IN IN ) (NP (JJ NORTHEASTERN ) (NNP NIGERIA ) ) ) ) ) ) ) ) (. . ) ) )'}}}}
There are a couple issues here:
- The nested
meta
,verbs
,nouns
construct is incorrect. - It's unclear what, exactly, is associated with what. For example, it isn't clear what the
[[u'~'], (u'NGAREB', [])])
construct refers to in the sentence.
This isn't relevant to this issue, but it should also be noted that this sentence doesn't code an event even though PETR is clearly identifying potential source and target actors and "assaulted" should be a relevant verb.
cc @philip-schrodt @ahalterman
It should also probably be noted that things like:
{(u'---COPLEG', u'---GOV', u'041'): [[u'CALLED'], [u'HAS']],
u'actorroot': {(u'---COPLEG', u'---GOV', u'041'): [u'', u'']},
u'actortext': {(u'---COPLEG', u'---GOV', u'041'): [u'deputy ... Congress',
u'Governor']},
u'eventtext': {(u'---COPLEG', u'---GOV', u'041'): u'has called'},
u'nouns': [([u' CONGRESS'], [u'~LEG'], [[u'~']]),
([u' DEPUTY', u' CONGRESS'], [u'~COPLEG'], [[u'~'], [u'~']]),
([u' GOVERNOR'], [u'~GOV'], [[u'~']]),
([u' ADMINISTRATION'], [u'~GOV'], [[u'~']])]}
cause hell if you're trying to dump that result out to JSON since the keys, e.g., (u'---COPLEG', u'---GOV', u'041')
aren't a hashable type.
Re-upping this since I discovered it again. The tuples as keys thing needs to be fixed ASAP since it's breaking hypnos.