presidio
presidio copied to clipboard
Unexpected silent exits of presidio application
First of all, thanks for the great work on this project.
I am encountering the following problem: The Python app silently exits indeterministicly during a call of anonymize_text(). Activating logging level DEBUG shows the following:
DEBUG:presidio-analyzer:Returning a total of 10 recognizers INFO:presidio-analyzer:Fetching all recognizers for language de DEBUG:presidio-analyzer:Returning a total of 10 recognizers
And that is the last output before the application just returns to command line. Other texts passed before are anonymized correctly.
- We do not have a custom analyzer, so this is out of the box
- Running with Python 3.12.3
- No error messages / stack trace shown
Any pointers / hints on what might cause this problems?
Hi, thanks for raising this. Would it be possible to create a slightly more detailed reproducible example? Is this running on pure Python, in Docker, or in pyspark?
Hi, it is really difficult / impossible to create a concise reproducible example, since it seems non-deterministic and I cannot share the data set. A bit more information:
- We are running pure Python (in a VS Code terminal)
- Base setup:
configuration = {
"nlp_engine_name": "spacy",
"models": [{"lang_code": "de", "model_name": "de_core_news_lg"}],
}
# Create NLP engine based on configuration
provider = NlpEngineProvider(nlp_configuration=configuration)
nlp_engine = provider.create_engine()
# the languages are needed to load country-specific recognizers
# for finding phones, passport numbers, etc.
analyzer = AnalyzerEngine(nlp_engine=nlp_engine,
supported_languages=["de"])
def anonymize_text(text: str) -> str:
logger.info(f"Anonymizing text: {text}")
analyzer_results = analyzer.analyze(text=text,
language='de')
logger.info(f"Anonymizer results: {analyzer_results}")
engine = presidio_anonymizer.AnonymizerEngine()
result = engine.anonymize(text=text, analyzer_results=analyzer_results)
logger.info(result)
# Restructuring anonymizer results
anonymization_results = {"anonymized": result.text,"found": [entity.to_dict() for entity in analyzer_results]}
return anonymization_results["anonymized"]
anonymize_text() is then basically called in a loop that fetches data from a SQL (MariaDB) table and writes the anonymized data into another table. Are there maybe any other trace options to get further output?
I also tried to see if the problem is with one of the registered anonymizers, trying to exclude some with combinations of
analyzer.registry.recognizers = analyzer.registry.recognizers[0:1]
to no avail.
Hi, I have the same issue: I'm using pure Python.
Below is the function that I'm using: It has worked once, the other times it fails a some point with no errors. The loop basically tries to run the scrubber on all message bodies inside a transcript object.
def scrub_transcript_messages(transcript, analyzer, anonymizer, entities=None):
if "transcript" not in transcript or "messages" not in transcript["transcript"]:
raise ValueError("Invalid transcript format. Expected 'transcript' key with 'messages' list.")
if entities is None:
entities = ["PHONE_NUMBER", "PERSON"]
scrubbed_transcript = {"transcript": {"messages": []}}
messages_list = transcript["transcript"]["messages"]
for message in messages_list:
scrubbed_message = message.copy()
try:
log.logger.info("Processing")
print(message["body"])
results = analyzer.analyze(
text=message["body"],
entities=entities,
language='en'
)
anonymized_text = anonymizer.anonymize(
text=message["body"],
analyzer_results=results
)
scrubbed_message["body"] = anonymized_text.text
log.logger.info("Scrubbed message")
print(anonymized_text.text)
except Exception as e:
scrubbed_message["body"] = f"Error anonymizing message: {e}"
scrubbed_transcript["transcript"]["messages"].append(scrubbed_message)
return scrubbed_transcript
Thanks, we're trying to reproduce this. @janorivera in your case, I see that you're collecting exceptions into the body of the scrubbed message. Do you have instances where the scrubbed message contains an error and not the scrubbed text?
Also, it could be more scalable to use the BatchAnalyzerEngine and BatchAnonymizerEngine to run presidio on a list of texts. https://microsoft.github.io/presidio/samples/python/batch_processing/ and https://microsoft.github.io/presidio/api/analyzer_python/#presidio_analyzer.batch_analyzer_engine.BatchAnalyzerEngine.analyze_iterator
Could you please check if this happens with batch mode too?
@grafandreas I'm trying to reproduce your case. I'm using this code. Is it different in any way from yours?
from logging import getLogger
logger = getLogger()
from presidio_analyzer import AnalyzerEngine
from presidio_analyzer.nlp_engine import NlpEngineProvider
import presidio_anonymizer
configuration = {
"nlp_engine_name": "spacy",
"models": [{"lang_code": "de", "model_name": "de_core_news_lg"}],
}
# Create NLP engine based on configuration
provider = NlpEngineProvider(nlp_configuration=configuration)
nlp_engine = provider.create_engine()
# the languages are needed to load country-specific recognizers
# for finding phones, passport numbers, etc.
analyzer = AnalyzerEngine(nlp_engine=nlp_engine,
supported_languages=["de"])
def anonymize_text(text: str) -> str:
logger.info(f"Anonymizing text: {text}")
analyzer_results = analyzer.analyze(text=text,
language='de')
logger.info(f"Anonymizer results: {analyzer_results}")
engine = presidio_anonymizer.AnonymizerEngine()
result = engine.anonymize(text=text, analyzer_results=analyzer_results)
logger.info(result)
# Restructuring anonymizer results
anonymization_results = {"anonymized": result.text,"found": [entity.to_dict() for entity in analyzer_results]}
return anonymization_results["anonymized"]
text = """
Hier sind ein paar Beispielsätze, die wir derzeit unterstützen:
Hallo, mein Name ist David Johnson, und ich komme ursprünglich aus Liverpool.
Meine Kreditkartennummer ist 4095-2609-9393-4932, und meine Krypto-Wallet-ID ist 16Yeky6GMjeNkAiNcBY7ZhrLoMSgg1BoyZ.
Am 11.10.2024 habe ich www.microsoft.com besucht und eine E-Mail an [email protected] von der IP-Adresse 192.168.0.1 gesendet.
Mein Reisepass: 191280342 und meine Telefonnummer: (212) 555-1234.
Dies ist eine gültige internationale Bankkontonummer: IL150120690000003111111. Können Sie bitte den Status des Bankkontos 954567876544 überprüfen?
Kates Sozialversicherungsnummer ist 078-05-1126. Ihr Führerschein? Er lautet 1234567A.
"""
for i in range(100000):
if i % 100 == 0:
print(i)
anonymize_text(text)
@omri374 Yes, that looks very much like the code I use, with the obvious exception of me using different texts.
Are the texts much longer? Contain non-unicode values? anything else that could be special about them? Are you running this in a certain compute environment?
@grafandreas are you able to provide additional information of the length of the text and any special values in it?