llm-guard
llm-guard copied to clipboard
How to Use Custom SpaCy Model (beki/en_spacy_pii_distilbert) with Anonymize and Sensitive Scanners
Hello llm_guard Team,
I've been exploring the use of custom models with the Anonymize and Sensitive scanners within the llm_guard library, as mentioned in the changelog for the latest release. Specifically, I'm interested in integrating the SpaCy model beki/en_spacy_pii_distilbert for PII detection tasks.
Objective My goal is to leverage the beki/en_spacy_pii_distilbert model, which is not a traditional Hugging Face Transformer model but rather a SpaCy model, for enhanced PII detection accuracy and reduced latency as highlighted in your changelog.
Issue I encountered difficulties when attempting to load and use this SpaCy model with the Anonymize scanner. Typically, the process for integrating models relies on specifying a model path or configuration that is compatible with Hugging Face's Transformer models. However, given that beki/en_spacy_pii_distilbert is a SpaCy model, the standard approach doesn't seem to apply.
Attempts Here's an outline of my approach so far, based on the available documentation and examples:
Model Specification: Attempted to specify beki/en_spacy_pii_distilbert directly as a model path or through a configuration dictionary. Custom Recognizer: Explored creating a custom recognizer to wrap the SpaCy model loading and analysis logic. Adapter Pattern: Considered using an adapter to bridge the gap between the expected input/output formats of the llm_guard scanners and the SpaCy model. The last approach is kind of working. But wanted to know best practise to use this model inside llm_guard
custom_recognizer = CustomSpacyRecognizer()
adapter = CustomRecognizerAdapter(custom_recognizer=custom_recognizer)
vault = Vault()
scanner = Anonymize(
vault=vault,
language="en",
use_faker=True,
custom_recognizer=adapter # Passing the adapter as the custom recognizer
)
Could you provide guidance or examples on how to correctly integrate a SpaCy model like beki/en_spacy_pii_distilbert with the Anonymize and Sensitive scanners?
Thank you for developing llm_guard and for your support in enhancing its capabilities. I look forward to your advice on integrating SpaCy models for improved PII detection.
Best regards, Rakend
Hey @rakendd , thanks for reaching out. We used to have this model but then realized that it blocked updates to the latest transformers due to dependency on "spacy-transformers>=1.1.8,<1.2.0"
.
https://llm-guard.com/changelog/#030-2023-10-14
I think if this model can be updated, then we could make another custom recognizer or just use the spacy one like we did before.