grobid-ner icon indicating copy to clipboard operation
grobid-ner copied to clipboard

Adding periods in PERIOD entity

Open lfoppiano opened this issue 8 years ago • 4 comments

We would need to support additional type of dates, atomic and intervals. For example:

File 1. Policy guidelines for "cultural" events. Reports on cultural activities (most of the documents in German). 2.09.1942-1.04.1944. 219 pages.

File 2. Article by Bykovich in "Commonwealth in-arms" newspaper for volunteers (No. 2 from 11.12.1944) entitled "The Destruction of the center of the bandit movement in Crimea. Release of the civilian population from prison." International information for February 1944 (in German). 1944. 108 pages.

entities extracted of type PERIOD should be something like:

2.09.1942-1.04.1944
from 11.12.1944
February 1944
1944

Tasks

  • [x] Add training data with more examples with periods
  • [x] call the date model to normalise the period and extract dates from the chunk

lfoppiano avatar Jan 05 '17 12:01 lfoppiano

I'm trying to get some progress on this subject and I have few questions/statements to be checked/answered:

  1. the NER will be responsible only for identify the PERIOD as such, without any information if it's an interval, a value etc...

  2. The date parser has to be adapted to recognise periods (from/to). Creating an intermediate model seems a bit overkill.

  3. I have created a training file and encoded such information. I tried to follow the grobid-quantities and TEI guidelines and this is the result:

<?xml version="1.0" encoding="UTF-8"?>
<dates>
    <date from="1944-05-22">vom <day>22</day>.<month>05</month>.<year>1944</year></date>
    <date from="1942-09-02" to="1944-04-01"><day>2</day>.<month>09</month>.<year>1942</year>-<day>1</day>.<month>04</month>.<year>1944</year></date>.
    <date when="1944-12-11">from <day>11</day>.<month>12</month>.<year>1944</year>)</date>
    <date when="1944-02"><month>February</month> <year>1944</year></date>
    <date when="1944"><year>1944</year></date>.
    <date when="1944"><year>1942</year></date>.
    <date from="1941" to="1945"><year>1941</year>-<year>45</year>.</date>
    <date from="1897" to="1941">from <year>1897</year> to <year>1941</year></date>. <date><year>1941</year>.</date>
    <date from="1941" to="1942"><year>1941</year>-<year>1942</year>.</date>
    <date from="1942-04-15" to="1944-03-23"><day>15</day>.<month>04</month>.<year>1942</year>-<day>23</day>.<month>03</month>.<year>1944</year></date>
    <date from="1943-01-04" to="1944-10-21"><day>4</day>.<month>01</month>.<year>1943</year>-<day>21</day>.<month>10</month>.<year>1943</year></date>
</dates>

I wonder whether an encoding using an addtional <date> tag would work better or woudl be just over the top, something like:

<date from="1941" to="1942"><date when=1941><year>1941</year></date>-<date when="1942"<year>1942</year></date>.</date>

@kermitt2 do you have any comment in this regards?

Thank you in advance ;-)

lfoppiano avatar Aug 10 '17 08:08 lfoppiano

Hello!

So the current Date model in GROBID parses only single dates and I think it's nice like that because there are many cases where we only have individual dates recognized by a higher level model (for example in GROBID, for the bibliographic information, there is never more than one date). But ok nothing about interval, lists...

So as grobid-ner is extracting more complex date chunks, you need an intermediary model - it makes sense, it's not overkill :)

I would say there would be two possible things to do, that's the way I was seeing it with grobid-ner:

  • try to use the quantity model from grobid-quantity to recognize the atomic/interval/list of dates, then use the GROBID date model to parse the involved dates, this is easy to experiment -> this might not work very well because 1) the quantity model is for normal text, not a restricted chunk 2) there is really not a lot of dates in grobid-quantities training data -> as a variant, you could train an alternative quantity model with just the portion of text where the quantities occur (so that it is also trained on chunks - this would be useful too for parsing quantities queries expressed in natural language)

  • create an additional intermediary date model for that, which is not overkill, but most likely the clean way to do it with respect to the overall "cascading" approach. Here you could use the available training data in grobid-quantities with dates to bootstrap a model (and your training example above).

Regarding the annotation, I think you have to fall back to the GROBID parsing of individual dates, so the first group of annotation in 3. will not work in the cascading approach, you need to find explicit individual dates which mean the second approach is much better.

Basically for the intermediary model, you would need only this - just follow the same guidelines as grobid quantities for annotating the chunk to make everything reusable:

<measure type="interval"><date from-iso="1941">1941</date> to <date to-iso="1942">1942</date></measure>

What's under the <date> element above will go the GROBID date parser for identifying years, months, etc. so you don't need to indicated year/month/date/hour/etc.

kermitt2 avatar Aug 10 '17 16:08 kermitt2

Thanks @kermitt2 :-)

lfoppiano avatar Aug 10 '17 22:08 lfoppiano

OK so in case a new module will be developed, where should be placed?

From the point of view of the genericity, it's a pretty standard/generic module (it could be applied before the date module) so my first though was grobid-core. However at the moment it's something used only in grobid-ner so my second though was to place it in this project.

Any advice?

lfoppiano avatar Aug 14 '17 15:08 lfoppiano