grobid-ner
grobid-ner copied to clipboard
question about INSTALLATION
- What to do with concentration camps names? We were thinking maybe INSTALLATION but it seems weird, and there are diverse cases, sometimes the camps are designated only by their location name, for example. Here are some occurrences:
- <ENAMEX type="INSTALLATION">Lager Nordhausen</ENAMEX>
- <ENAMEX type="LOCATION">Mittelbau-Dora</ENAMEX>
- <ENAMEX type="LOCATION">Mauthausen-Gusen</ENAMEX> concentration camp
and ghetto names, for example the Warsaw Ghetto?
I think it indeed all refer in context to the facilities (not the cities), so INSTALLATION
.
I would say a ghetto name is a LOCATION
, it's a city district.
I reopen this thing to ask a question in the context of the second layer annotation, regarding mentions of concentration camps
In the file Wikipedia_holocaust.1.en.training.2layers.xml, there is this sentence:
Some [physicians] carried out experiments at Auschwitz, Dachau, Buchenwald, Ravensbrück, Sachsenhausen, and Natzweiler concentration camps
Annotated in the first annotation as:
Some [physicians] carried out experiments at <ENAMEX type="INSTALLATION">Auschwitz,
Dachau, Buchenwald, Ravensbrück, Sachsenhausen, and Natzweiler concentration camps</ENAMEX>
In this sentence, I'm not sure how to annotate subtypes:
Some carried out experiments at <ENAMEX type="INSTALLATION"><ENAMEX subType="2"
type="LOCATION">Auschwitz</ENAMEX>, <ENAMEX subType="2" type="LOCATION">
Dachau</ENAMEX> (...), and <ENAMEX subType="2" type="INSTALLATION">Natzweiler
concentration camps</ENAMEX></ENAMEX>
or just
Some carried out experiments at <ENAMEX type="INSTALLATION">Auschwitz, Dachau, and
Natzweiler concentration camps</ENAMEX>
(since it would be wrong to annotate only LOCATION subtypes)
?
thank you : )
Reminder about LOCATION/INSTALLATION with concentration/extermination camps:
-
Auschwitz, Dachau
, etc. were primarily LOCATIONs, so when they are alone and clearly referring to the locations, we annotate them as LOCATIONs, for example:[extermination camps] were established at <ENAMEX type="LOCATION"><ENAMEX subType="2" type="LOCATION">Auschwitz</ENAMEX>, <ENAMEX subType="2" type="LOCATION">Belzec</ENAMEX> (...), and <ENAMEX subType="2" type="LOCATION">Treblinka</ENAMEX></ENAMEX>
-
more often, we find mentions of the concentration/extermination camps themselves, with or without the words "camp"/"concentration camp", and they're INSTALLATION for example (fictive example):
they were killed at <ENAMEX type="INSTALLATION">Auschwitz concentration camp</ENAMEX>
Given the subtype is quite clear, I would say:
Some carried out experiments at <ENAMEX type="INSTALLATION"><ENAMEX subType="2"
type="LOCATION">Auschwitz</ENAMEX>, <ENAMEX subType="2" type="LOCATION">
Dachau</ENAMEX> (...), and <ENAMEX subType="2"
type="LOCATION">Natzweiler</ENAMEX> concentration camps</ENAMEX></ENAMEX>
ok but we can't annotate the subtypes if not one of them is the same as the general type (here the subtypes would all be LOCATION whereas the general type is INSTALLATION). So it would be the second option (Some carried out experiments at <ENAMEX type="INSTALLATION">Auschwitz, Dachau, and Natzweiler concentration camps</ENAMEX>
)
is it ok?
mmm I don't understand why we can't annotate the subtypes if not one of them is the same as the general type, it's the general case no? like president of the United State
From what I understood about this subtype annotation, in that case that would be ok because in president of the United State
, the general type is TITLE and the first subtype is also TITLE:
<ENAMEX type="TITLE"><ENAMEX subType="2" type="TITLE">president</ENAMEX>
of the <ENAMEX subType="2" type="LOCATION"> United State</ENAMEX></ENAMEX>
Well if we follow this logic then concentration camps
is the installation subtype:
Some carried out experiments at <ENAMEX type="INSTALLATION"><ENAMEX subType="2"
type="LOCATION">Auschwitz</ENAMEX>, <ENAMEX subType="2" type="LOCATION">
Dachau</ENAMEX> (...), and <ENAMEX subType="2" type="LOCATION">Natzweiler</ENAMEX> <ENAMEX subType="2" type="INSTALLATION">concentration camps</ENAMEX></ENAMEX>
however I think it does not make much sense, as concentration camps is not a named entity alone. Maybe not for President of the United States
because we have a title indeed, but it's very common cases imho, the university of Washington
, alpha B2 proteins
, the United NAtion headquarters
, and so on... For nested entities, often the actual head of the NE is not a NE, it's a common name.
"For nested entities, often the actual head of the NE is not a NE, it's a common name"
absolutely, and in those case we didn't annotate subtypes, is that wrong?
Well I would not say wrong
:D
but I don't see why - the subtype is easy and useful to annotate:
<ENAMEX type="INSTALLATION"><ENAMEX type="ORGANISATION">United Nation</ENAMEX> headquarters</ENAMEX>
same for the concentration camps, nice to have the actual locations in the nested NE.
But did I misunderstand or overlook something? What was the reason to ignore these nested NE?
The goal as I see it is to have two layers, the larger entities matching (type: "lsm") and 'normal' annotations (no type).
For our purposes (grobid-ner) they cannot be used at the same time. In combination with the Idillia corpus we should use the normal
annotations, thought the LSM could be used as well for other tasks (though the data available is not a lot).
The first idea is to split the longest entity match that are separated by commas, then try to split composed annotations (e.g. The President of the United States Barak Obama
).
On the other hand, if we go too much in details, we end up getting a lot of LOCATION
, PERSON
and ORGANISATION
and loosing a lot of EVENT
, CONCEPT
, TITLE
and so on, therefore in the effort to find a balance, the rule "do not split when the outer annotation is lost" was consistent enough.
The example we would loose a lot of EVENTS
if splitting in the inner annotations as they often contains a LOCATION
, a PERIOD
etc.
mm I have to say I don't understand at all the reasoning.
We have two layers of annotations "largest" NE and nested NE within the largest NE.
This is something well known and we could have 3 levels, but let's say we limit to 2. What's the point of NOT annotating uniformly the nested NE? Both are self-excluding level (a nested NE does not override the global NE). What does mean "loosing a lot of event
, concept
, title
?
I also not understand what is split - we have nested structure. Do you mean we could have different hierarchical phrases under the same "largest" NE? We only have one valid dependency tree in the context of a text in context.
The two levels of annotations can be understood as a hierarchical structure, so a cascade in the CRF. It is exactly like a syntactic tree with NP
and for instance D Adj N
as second level, it is expected that each level is introducing a different set of components, with hierarchical relations and which should not be considered in a flatten manner.
One concrete problem of not annotating consistently the nested NE is that we could not train simply two different models, one for "largest" NE, and one in cascade for the nested NE. Or one which works at lowest level and one at largest level. For instance, if we have in one case:
<ENAMEX type="INSTALLATION">Auschwitz concentration camps</ENAMEX>
and in another
<ENAMEX type="LOCATION"><ENAMEX type="LOCATION">Auschwitz</ENAMEX> <ENAMEX type="LOCATION">Marktplatz</ENAMEX></ENAMEX>
how an annotator can learn consistently to sometimes annotate a location or not - while in both case it's a location - and the same location - without creating ad hoc models for each largest NE type?
I see your point. Given the time available left, the attempt was more to "amend" the first version done in the first three months (many annotations were matching just too much) to be compatible with Idillia corpus.
The subType
still have to be reassigned only to the Largest Entity Matching entities.