presidio Docker support for Stanza NLP Engine

    I'm trying to start the analyzer api with the stanza engine.

conf/default.yaml:

nlp_engine_name: stanza
models:
  -
    lang_code: en
    model_name: en

Output:

root@27ed8ca2f545:/usr/bin/presidio-analyzer# pipenv run python app.py --host 0.0.0.0
2023-02-09 20:41:38,211 - presidio-analyzer - INFO - Starting analyzer engine
2023-02-09 20:41:38,214 - presidio-analyzer - INFO - nlp_engine not provided, creating default.
Traceback (most recent call last):
  File "/usr/bin/presidio-analyzer/app.py", line 130, in <module>
    server = Server()
  File "/usr/bin/presidio-analyzer/app.py", line 40, in __init__
    self.engine = AnalyzerEngine()
  File "/usr/bin/presidio-analyzer/presidio_analyzer/analyzer_engine.py", line 58, in __init__
    nlp_engine = provider.create_engine()
  File "/usr/bin/presidio-analyzer/presidio_analyzer/nlp_engine/nlp_engine_provider.py", line 81, in create_engine
    raise ValueError(
ValueError: NLP engine 'stanza' is not available. Make sure you have all required packages installed

I've tried manually installing stanza, both on a container level, and within an activated virtual environment:

# install for container
pipenv install stanza

# install for virtual env
source .venv/bin/activate
pipenv install stanza

But I get the same error. What do I miss here?

Originally posted by @edeak in https://github.com/microsoft/presidio/discussions/1027

Feb 12 '23 07:02 omri374

Hi @edeak, this is not yet supported, but should be a simple extension. If you look at the Docker.transformers file, adapting it to stanza should be relatively easy:

FROM python:3.9-slim

ARG NAME
ARG NLP_CONF_FILE=conf/stanza.yaml
ENV PIPENV_VENV_IN_PROJECT=1
ENV PIP_NO_CACHE_DIR=1
WORKDIR /usr/bin/${NAME}

COPY ./Pipfile* /usr/bin/${NAME}/
RUN pip install pipenv \
  && pipenv sync
RUN pipenv install spacy-stanza --skip-lock

# install nlp models specified in conf/default.yaml
COPY ./install_nlp_models.py /usr/bin/${NAME}/
COPY ${NLP_CONF_FILE} /usr/bin/${NAME}/${NLP_CONF_FILE}

RUN pipenv run python install_nlp_models.py --conf_file ${NLP_CONF_FILE}

COPY . /usr/bin/${NAME}/
EXPOSE ${PORT}
CMD pipenv run python app.py --host 0.0.0.0

I haven't tested this, so if you give it a try please reply if it worked or didn't, and a PR would be absolutely a fantastic addition.

Feb 12 '23 07:02 omri374

Hi @omri374

I was able to start my container with Stanza, but it's a bit beyond the Dockerfile editing. As I understand, the appropriate NLPEngine has to be passed to the app otherwise the default Spacy engine would be used. So in app.py line 41, I do

self.engine = AnalyzerEngine(nlp_engine={ENGINE})

where the engine could be StanzaNlpEngine for stanza and TransformersNlpEngine for transformers.

Feb 12 '23 20:02 edeak

will play with it and maybe I can put a PR together with the ability to inject the right NLPEngine based on the passed conf

Feb 12 '23 20:02 edeak

Yes, the app.py has to be adapted to this too. This should be straightforward if you inject the conf file as you mentioned.

Feb 13 '23 17:02 omri374