ParlaMint icon indicating copy to clipboard operation
ParlaMint copied to clipboard

Schema: NER restriction

Open matyaskopp opened this issue 4 years ago • 3 comments

Current schema allows this situation:

<name type="LOC">
  <kinesic type="applause">
    <desc>Oklaski</desc>
  </kinesic>
</name>

https://github.com/clarin-eric/ParlaMint/blob/92ba447bf720cf48d038ec3044257534332f18a7/Schema/ParlaMint-TEI.ana.rng#L112-L121

The schema should be restricted in this way:

  • every named entity should contain oneOrMore named entities or words.
  • And zeroOrMore comments

Related issue: #84

matyaskopp avatar May 25 '21 11:05 matyaskopp

I agree that should be restricted, but

  • Did you actually find such cases in the corpora? At least for the example that you gave, as far as I see, it doesn't exists in the PL corpus. I would be surprised if it did exist, as incidents were exceluded from annotation, so the system would in fact be annotating an empty string as NER
  • It will make the content model more complicated, in fact I'm not really sure how to impletement such a restriction, would have to study RelaxNG first.

Not saying I won't do it, just maybe not straight away.

TomazErjavec avatar May 25 '21 11:05 TomazErjavec

Did you actually find such cases in the corpora

No, I have built it based on the wrongly understood example from #84

IIt will make the content model more complicated, in fact I'm not really sure how to impletement such a restriction, would have to study RelaxNG first.

I don't know either. (CZ NER already made schema quite complicated...)

Not saying I won't do it, just maybe not straight away.

Ok, let's keep this issue for the next releases

matyaskopp avatar May 25 '21 11:05 matyaskopp

This is obviously "future"....

TomazErjavec avatar Sep 21 '23 18:09 TomazErjavec