Extracting entities inside an entity
Does anyone knows how to write a custom parser to extract a named entity inside an entity.
For example from the following sentence I want to extract 'boiling' which will be inside the prefix entity.
d = Sentence('Synthesis of 2,4,6-trinitrotoluene (3a).The procedure was followed to yield a pale yellow solid (boiling point 240 °C)')
This is my attempt to write the parser:
class BoilingPoint(BaseModel):
value = StringType()
units = StringType()
prefix = StringType()
name = StringType()
Compound.boiling_points = ListType(ModelType(BoilingPoint))`
prefix = (R(u'^b\.?p\.?$', re.I) | I(u'boiling')(u'name') + I(u'point')).add_action(join)(u'prefix')
units = (W(u'°') + Optional(R(u'^[CFK]\.?$')))(u'units').add_action(merge)
value = R(u'^\d+(\.\d+)?$')(u'value')
bp = (prefix + value + units)(u'bp')
class BpParser(BaseParser):
root = bp
def interpret(self, result, start, end):
compound = Compound(
boiling_points=[
BoilingPoint(
value=first(result.xpath('./value/text()')),
units=first(result.xpath('./units/text()')),
prefix = first(result.xpath('./prefix/text()')),
name = first(result.xpath('./name/text()')),
)
]
)
yield compound
Sentence.parsers = [BpParser()]
However what d.records.serialize() produces is,
[{'boiling_points': [{'value': '240', 'units': '°C', 'prefix': 'boiling point'}]}]
All you have to do is tweak the xpath you use to access the result from the name element. Element results are returned as a tree with whatever you assign to root as the root and all the elements that form a part of root as child nodes, and so on.
So you would write name = first(result.xpath('./prefix/name/text()')), since name is a child of prefix
All you have to do is tweak the xpath you use to access the result from the
nameelement. Element results are returned as a tree with whatever you assign torootas the root and all the elements that form a part ofrootas child nodes, and so on.So you would write
name = first(result.xpath('./prefix/name/text()')), sincenameis a child ofprefix
I tried that, but I am still getting the same output as before.
might be the .add_action(join) then. Seems like that merges all of the tokens and puts them in the same node. It may not be the best solution, but the first thing that comes to my mind is to capture boiling and point as separate elements and then join them within interpret(). I'm actually curious so I'm about to do my own tests
Thanks for the suggestion! I haven't worked with interpret(). I am going to start experimenting with it.