presidio
presidio copied to clipboard
Using Presidio with Huggingface support
Hi, currently I am using presidio with Spacy and Stanza by creating an nlp_engine
using NlpEngineProvider
and passing it the correct model in the config. I was planning on adding support for HuggingFace transformer models, but I was a bit confused by the fact that there are 2 ways of doing this:
- Using a
TransformerRecognizer
- Using
TransformerNlpEngine
As far as I understand, if you use the recognizer then you apply the recognizer on top of the usual e.g. Spacy NER pipeline so you will get results from both Spacy and HuggingFace model. On the other hand, using the TransformerNlpEngine substitutes the Spacy NER module in the pipeline.
In this example: https://microsoft.github.io/presidio/samples/python/transformers_recognizer/ it is shown how to use the TransformersRecognizer with a specific configuration given as an example in configuration.py where you can do the MODEL_TO_PRESIDIO_MAPPING
. If you are to use the TransformerNlpEngine
, how are you supposed to do the mapping between model types and presidio types similar to the ones done in TransformerRecognizer
?
Is my understanding above right and if yes, is there a way to create an AnalyzerEngine
with a TransformerNlpEngine
with the same configuration as a TransformerRecognizer
?
Thanks for the help!
Actually, after checking the source code more, it's actually not clear to me how one is supposed to use the TransformerNlpEngine
. What is the TransformersComponent
class used for in this case?
Using the TransformerRecognizer
seems easier as there are more code examples, but is it advised to use it over TransformerNlpEngine
?
Hi @Matei9721, thanks for your feedback! I can understand why this causes confusion. We initially wanted to support Huggingface the same way we support Stanza, but bumped into some issues. In the future, the plan is to integrate the new spacy-huggingface-pipelines package for a more seamless integration.
The easiest path forward, IMHO, is to use the TransformerRecognizer in parallel to the default SpacyNlpEngine. In our demo website's code, you'll find a method which does this. It uses the small spacy model to reduce the overhead (but maintain capabilities like lemmas), and removes the SpacyRecognizer
to avoid getting results from both spaCy and the transformers model. I'll paste it here too:
def create_nlp_engine_with_transformers(
model_path: str,
) -> Tuple[NlpEngine, RecognizerRegistry]:
"""
Instantiate an NlpEngine with a TransformersRecognizer and a small spaCy model.
The TransformersRecognizer would return results from Transformers models, the spaCy model
would return NlpArtifacts such as POS and lemmas.
:param model_path: HuggingFace model path.
"""
from transformers_rec import (
STANFORD_COFIGURATION,
BERT_DEID_CONFIGURATION,
TransformersRecognizer,
)
registry = RecognizerRegistry()
registry.load_predefined_recognizers()
if not spacy.util.is_package("en_core_web_sm"):
spacy.cli.download("en_core_web_sm")
# Using a small spaCy model + a HF NER model
transformers_recognizer = TransformersRecognizer(model_path=model_path)
if model_path == "StanfordAIMI/stanford-deidentifier-base":
transformers_recognizer.load_transformer(**STANFORD_COFIGURATION)
elif model_path == "obi/deid_roberta_i2b2":
transformers_recognizer.load_transformer(**BERT_DEID_CONFIGURATION)
else:
print(f"Warning: Model has no configuration, loading default.")
transformers_recognizer.load_transformer(**BERT_DEID_CONFIGURATION)
# Use small spaCy model, no need for both spacy and HF models
# The transformers model is used here as a recognizer, not as an NlpEngine
nlp_configuration = {
"nlp_engine_name": "spacy",
"models": [{"lang_code": "en", "model_name": "en_core_web_sm"}],
}
registry.add_recognizer(transformers_recognizer)
registry.remove_recognizer("SpacyRecognizer")
nlp_engine = NlpEngineProvider(nlp_configuration=nlp_configuration).create_engine()
return nlp_engine, registry
Hope this helps. We'll work on making this easier going forward.
Thank you for your swift reply @omri374 , that's exactly what I ended up following! I just wanted to make sure that I am doing it in the "best" way possible and not re-invent the wheel. :) Looking forward to the spacy-hugging face-pipeline addition as it seems to indeed streamline the process more.
I will close the issue as my questions were answered and it's clear how to approach the task now!
Dear @omri374 & @Matei9721 ,
Sorry to re-open this issue. The answers are really helpful.
After reviewing the demo website's code, I have the feeling that the TransformersRecognizer
used here (coming from here docs/samples/python/streamlit/transformers_rec/transformers_recognizer.py) is different than the one included in the package (in the pre-defined recognizers here presidio-analyzer/presidio_analyzer/predefined_recognizers/transformers_recognizer.py).
Am I wrong and can I use the TransformersRecognizer
from the pre-defined recognizers in the package in a very similar workflow as the one presented in the demo website's code ?
Thanks in advance !
Hi @LSD-98, you are correct. There are essentially two flows here, and we're also about to improve the experience in the upcoming weeks, but in essence, the flows are:
- Use a NER model as part of the
NlpEngine
. This is how spaCy models are used by default. Entities are extracted during theNlpEngine
phase, and passed to recognizers. theSpacyRecognizer
collects those and returns a list ofRecognizerResult
. We extended this capability to support Huggingface/transformers models as well, which are used as part of a spaCy pipeline (see #887). This is where theTransformersRecognizer
in the package gets into the picture. All it does is collect the entities already extracted from the model during theNlpEngine
phase. - In parallel, it is always possible to create new recognizers calling any model. The
TransformersModel
sample on the demo site and on the docs/samples follows this approach. During the call to the.analyze
method, it calls the model to get the predictions. This allows the flexibility of calling 5 different models, or having models serving different languages.
In essense: Flow 1:
sequenceDiagram
AnalyzerEngine->>SpacyNlpEngine: Call engine.process_text(text) <br>to get model results
SpacyNlpEngine->>NamedEntityRecognitionModel: call spaCy NER model
NamedEntityRecognitionModel->>SpacyNlpEngine: return PII entities
SpacyNlpEngine->>AnalyzerEngine: Pass NlpArtifacts<BR>(Entities, lemmas, tokens etc.)
Note over AnalyzerEngine: Call all recognizers
AnalyzerEngine->>SpacyRecognizer: Pass NlpArtifacts
Note over SpacyRecognizer: Extract PII entities out of NlpArtifacts
SpacyRecognizer->>AnalyzerEngine: Return List[RecognizerResult]<BR>based on entities
Flow 2:
sequenceDiagram
Note over AnalyzerEngine: Call all recognizers, <br>including <br>MyNerModelRecognizer
AnalyzerEngine->>MyNerModelRecognizer: call .analyze
MyNerModelRecognizer->>transformers_model: Call transformers model
transformers_model->>MyNerModelRecognizer: get NER/PII entities
MyNerModelRecognizer->>AnalyzerEngine: Return List[RecognizerResult] <br>of PII entities
Where MyNerModelRecognizer
is a wrapper over an NLP library, similar to the transformers example and flair example.
Reopening to improve logic and docs. Will be fixed in #1159
nlp_configuration = { "nlp_engine_name": "spacy", "models": [{"lang_code": "en", "model_name": "en_core_web_sm"}], }
@omri374 This seems to work for current model and for english language. However, when I want to use french HF models , I get following error.
ValueError: No matching recognizers were found to serve the request.
These are the changes I made:
from presidio_analyzer import AnalyzerEngine, RecognizerRegistry
from transformers_recognizer import TransformersRecognizer
import spacy
from presidio_analyzer.nlp_engine import NlpEngineProvider
FR_MODEL_CONF = {'PRESIDIO_SUPPORTED_ENTITIES': ['LOCATION', 'PERSON', 'ORGANIZATION', 'DATE_TIME', 'NRP'],
'DEFAULT_MODEL_PATH': 'Jean-Baptiste/camembert-ner-with-dates',
'DATASET_TO_PRESIDIO_MAPPING': {'DATE': 'DATE_TIME', 'MISC': 'NRP', 'PER': 'PERSON', 'ORG': 'ORGANIZATION', 'LOC': 'LOCATION'},
"MODEL_TO_PRESIDIO_MAPPING": {'DATE': 'DATE_TIME', 'MISC': 'NRP', 'PER': 'PERSON', 'ORG': 'ORGANIZATION', 'LOC': 'LOCATION'},
"CHUNK_OVERLAP_SIZE": 40,
"CHUNK_SIZE": 600,
"ID_SCORE_MULTIPLIER": 0.4,
"ID_ENTITY_NAME": "ID"}
registry = RecognizerRegistry()
registry.load_predefined_recognizers()
if not spacy.util.is_package("fr_core_news_sm"):
spacy.cli.download("fr_core_news_sm")
supported_entities = FR_MODEL_CONF.get(
"PRESIDIO_SUPPORTED_ENTITIES")
model = "Jean-Baptiste/camembert-ner-with-dates"
transformers_recognizerr = TransformersRecognizer(model_path=model, supported_entities= supported_entities)
transformers_recognizerr.load_transformer(**FR_MODEL_CONF)
if not spacy.util.is_package("fr_core_news_sm"):
spacy.cli.download("fr_core_news_sm")
registry.add_recognizer(transformers_recognizerr)
registry.remove_recognizer("SpacyRecognizer")
nlp_configuration = {
"nlp_engine_name": "spacy",
"models": [{"lang_code": "fr", "model_name":"fr_core_news_sm"}],
}
nlp_engine = NlpEngineProvider(nlp_configuration=nlp_configuration).create_engine()
analyzer = AnalyzerEngine(registry=registry, nlp_engine=nlp_engine)
results = analyzer.analyze(
text="Je m'appelle jean-baptiste et j'habite à montréal depuis fevr 2012",
language="fr",
entities = ['LOCATION', 'PERSON', 'ORGANIZATION', 'DATE_TIME', 'NRP'],
return_decision_process=True,
)
for result in results:
print(result)
print(result.analysis_explanation)
Many thanks @omri374 for the reply, very clear.
```python nlp_configuration = { "nlp_engine_name": "spacy", "models": [{"lang_code": "en", "model_name": "en_core_web_sm"}], }
@omri374 This seems to work for current model and for english language. However, when I want to use french HF models , I get following error.
ValueError: No matching recognizers were found to serve the request.
These are the changes I made:
from presidio_analyzer import AnalyzerEngine, RecognizerRegistry from transformers_recognizer import TransformersRecognizer import spacy from presidio_analyzer.nlp_engine import NlpEngineProvider FR_MODEL_CONF = {'PRESIDIO_SUPPORTED_ENTITIES': ['LOCATION', 'PERSON', 'ORGANIZATION', 'DATE_TIME', 'NRP'], 'DEFAULT_MODEL_PATH': 'Jean-Baptiste/camembert-ner-with-dates', 'DATASET_TO_PRESIDIO_MAPPING': {'DATE': 'DATE_TIME', 'MISC': 'NRP', 'PER': 'PERSON', 'ORG': 'ORGANIZATION', 'LOC': 'LOCATION'}, "MODEL_TO_PRESIDIO_MAPPING": {'DATE': 'DATE_TIME', 'MISC': 'NRP', 'PER': 'PERSON', 'ORG': 'ORGANIZATION', 'LOC': 'LOCATION'}, "CHUNK_OVERLAP_SIZE": 40, "CHUNK_SIZE": 600, "ID_SCORE_MULTIPLIER": 0.4, "ID_ENTITY_NAME": "ID"} registry = RecognizerRegistry() registry.load_predefined_recognizers() if not spacy.util.is_package("fr_core_news_sm"): spacy.cli.download("fr_core_news_sm") supported_entities = FR_MODEL_CONF.get( "PRESIDIO_SUPPORTED_ENTITIES") model = "Jean-Baptiste/camembert-ner-with-dates" transformers_recognizerr = TransformersRecognizer(model_path=model, supported_entities= supported_entities) transformers_recognizerr.load_transformer(**FR_MODEL_CONF) if not spacy.util.is_package("fr_core_news_sm"): spacy.cli.download("fr_core_news_sm") registry.add_recognizer(transformers_recognizerr) registry.remove_recognizer("SpacyRecognizer") nlp_configuration = { "nlp_engine_name": "spacy", "models": [{"lang_code": "fr", "model_name":"fr_core_news_sm"}], } nlp_engine = NlpEngineProvider(nlp_configuration=nlp_configuration).create_engine() analyzer = AnalyzerEngine(registry=registry, nlp_engine=nlp_engine) results = analyzer.analyze( text="Je m'appelle jean-baptiste et j'habite à montréal depuis fevr 2012", language="fr", entities = ['LOCATION', 'PERSON', 'ORGANIZATION', 'DATE_TIME', 'NRP'], return_decision_process=True, ) for result in results: print(result) print(result.analysis_explanation)
I tried the same thing last week and had the exact same issue. I did not manage to solve it and moved to another project. I assume there will be an easier way to use HF models when #1159 is pushed!
Make sure you pass the language argument to the TransformersRecognizer