NeMo-Curator
NeMo-Curator copied to clipboard
Extend support to non-English languages for PII Deidentifier
Is your feature request related to a problem? Please describe.
My team is currently working on removing PII information from text data that are in South East Asian languages. When using the PIIDeIdentifier for these specific languages, it throws the following error: ValueError: No matching recognizers were found to serve the request. It seems that it only has support for English language.
Describe the solution you'd like It would be helpful if PII can be detected in South East Asian languages (e.g Bahasa Indonesia, Thai, Vietnamese)
Describe alternatives you've considered The underlying package used is Presidio. Presidio uses Spacy and Stanza NER models as part of its detection. There are models available in SpaCy and Stanza that supports some of the South East Asian languages. They can be adapted for this use case