text-anonymization-benchmark
text-anonymization-benchmark copied to clipboard
TAB Pre Annotation Phase
Dear researcher, thank you very much for this amazing work!
I'm curious to expand the dataset to other domains, and wish to preserve the annotation guidelines. In order to minimize the differences and to follow precisely your work, I would have wanted to run the same pre-annotation procedure as you did, would you be fine to share this part more in detail/code?
I do it as a part of a research project of the TrustHLT Group and this can be extremely beneficial to us!
Thank you very much, Kai.
Hi Kai,
Sure, here is the Python code we used to pre-annotate the documents from the ECHR. As you can see, it essentially boils down to:
- Running Spacy to get named entities, + a few simple regular expressions to detect codes and dates
- Correcting those entities with a few heuristics, and mapping the 18 Ontonotes categories to the privacy-oriented categories we had defined
The code is really tailored to ECHR documents and their formatting though, so I’m not sure how useful it would be to other domains, apart perhaps for the mapping between Ontonotes NE and the more privacy-oriented categories from TAB.
Pierre