NeMo-Curator icon indicating copy to clipboard operation
NeMo-Curator copied to clipboard

Extend support to non-English languages for PII Deidentifier

Open hamsarajan opened this issue 9 months ago • 1 comments

Is your feature request related to a problem? Please describe.

My team is currently working on removing PII information from text data that are in South East Asian languages. When using the PIIDeIdentifier for these specific languages, it throws the following error: ValueError: No matching recognizers were found to serve the request. It seems that it only has support for English language.

Describe the solution you'd like It would be helpful if PII can be detected in South East Asian languages (e.g Bahasa Indonesia, Thai, Vietnamese)

Describe alternatives you've considered The underlying package used is Presidio. Presidio uses Spacy and Stanza NER models as part of its detection. There are models available in SpaCy and Stanza that supports some of the South East Asian languages. They can be adapted for this use case

hamsarajan avatar Feb 18 '25 10:02 hamsarajan