text-anonymization-benchmark TAB Pre Annotation Phase

TAB Pre Annotation Phase

Open golankai opened this issue 1 year ago • 2 comments

Dear researcher, thank you very much for this amazing work!

I'm curious to expand the dataset to other domains, and wish to preserve the annotation guidelines. In order to minimize the differences and to follow precisely your work, I would have wanted to run the same pre-annotation procedure as you did, would you be fine to share this part more in detail/code?

I do it as a part of a research project of the TrustHLT Group and this can be extremely beneficial to us!

Thank you very much, Kai.

Feb 08 '24 15:02 golankai

Hi Kai,

Sure, here is the Python code we used to pre-annotate the documents from the ECHR. As you can see, it essentially boils down to:

Running Spacy to get named entities, + a few simple regular expressions to detect codes and dates
Correcting those entities with a few heuristics, and mapping the 18 Ontonotes categories to the privacy-oriented categories we had defined

The code is really tailored to ECHR documents and their formatting though, so I’m not sure how useful it would be to other domains, apart perhaps for the mapping between Ontonotes NE and the more privacy-oriented categories from TAB.

Pierre

Feb 09 '24 10:02 plison

The file is here: https://github.com/NorskRegnesentral/text-anonymization-benchmark/blob/master/scripts/annotate.py

Feb 09 '24 10:02 plison

text-anonymization-benchmark text-anonymization-benchmark copied to clipboard

TAB Pre Annotation Phase

text-anonymization-benchmark
text-anonymization-benchmark copied to clipboard