immich icon indicating copy to clipboard operation
immich copied to clipboard

[BUG] Machinelearning crashing on k8s deployment v1.57.1 - v1.62.0

Open gcarrarom opened this issue 1 year ago β€’ 22 comments

The bug

There's a bug when using version 1.56.1 on kubernetes using the official helm chart: zsh ⌁ klf immich-machine-learning-54b5766488-kx4b4
Traceback (most recent call last): File "/opt/venv/lib/python3.10/site-packages/insightface/init.py", line 8, in import onnxruntime File "/opt/venv/lib/python3.10/site-packages/onnxruntime/init.py", line 55, in raise import_capi_exception File "/opt/venv/lib/python3.10/site-packages/onnxruntime/init.py", line 23, in from onnxruntime.capi._pybind_state import ( File "/opt/venv/lib/python3.10/site-packages/onnxruntime/capi/_pybind_state.py", line 33, in from .onnxruntime_pybind11_state import * # noqa ImportError: /opt/venv/lib/python3.10/site-packages/onnxruntime/capi/onnxruntime_pybind11_state.cpython-310-x86_64-linux-gnu.so: cannot enable executable stack as shared object requires: Permission denied

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/src/app/src/main.py", line 6, in from insightface.app import FaceAnalysis File "/opt/venv/lib/python3.10/site-packages/insightface/init.py", line 10, in raise ImportError( ImportError: Unable to import dependency onnxruntime.

The OS that Immich Server is running on

Kubernetes - k3s - MicroOS

Version of Immich Server

v1.56.1

Version of Immich Mobile App

N/A

Platform with the issue

  • [X] Server
  • [ ] Web
  • [ ] Mobile

Your docker-compose.yml content

postgresql:
      enabled: true
redis:
  enabled: true

typesense:
  enabled: true
  persistence:
    tsdata:
      enabled: true
      existingClaim: typesense-data

machine-learning:
  persistence:
    cache:
      enabled: true
      existingClaim: machinelearning-data
proxy:
  ingress:
    main:
      enabled: true
      ingressClassName: nginx
      annotations:
        nginx.ingress.kubernetes.io/proxy-body-size: "0"
        cert-manager.io/cluster-issuer: letsencrypt
      hosts:
        - host: my.domain.com
          paths:
            - path: "/"
      tls:
        - hosts:
            - my.domain.com
          secretName: my-domain-com
image:
  tag: v1.56.1
immich:
  persistence:
    library:
      existingClaim: photos

Your .env content

postgresql:
      enabled: true
redis:
  enabled: true

typesense:
  enabled: true
  persistence:
    tsdata:
      enabled: true
      existingClaim: typesense-data

machine-learning:
  persistence:
    cache:
      enabled: true
      existingClaim: machinelearning-data
proxy:
  ingress:
    main:
      enabled: true
      ingressClassName: nginx
      annotations:
        nginx.ingress.kubernetes.io/proxy-body-size: "0"
        cert-manager.io/cluster-issuer: letsencrypt
      hosts:
        - host: my.domain.com
          paths:
            - path: "/"
      tls:
        - hosts:
            - my.domain.com
          secretName: my-domain-com
image:
  tag: v1.56.1
immich:
  persistence:
    library:
      existingClaim: photos

Reproduction steps

1. Deploy the helm chart on that version
2. Wait for all pods to come up and machinelearning to crash.

Additional information

No response

gcarrarom avatar May 19 '23 15:05 gcarrarom

Just to update: I've now rolled back to 1.56.0 and it's working flawlessly. It's probably a bug introduced on 1.56.1.

gcarrarom avatar May 19 '23 15:05 gcarrarom

HMm from 1.56.0 to 1.56.1 we only changed the server and the web related code πŸ€”

alextran1502 avatar May 19 '23 15:05 alextran1502

There have been a few reports of related onyx runtime errors that have been fixed by delete the machine learning cache volume. Rolling back versions might have done that in your situation.

jrasm91 avatar May 19 '23 17:05 jrasm91

Great to know, I'll try pushing 1.56.1 again and clear the cache. I should report back in a few hours.

gcarrarom avatar May 19 '23 17:05 gcarrarom

Odd, Just upgraded to 1.56.1 and still the same error. I've removed the emptyDir cache folder and the error persists. Tried creating it using another storage class and same issue. Could it be something else in another directory? Same error here:

zsh ⌁ klf immich-machine-learning-5d4b859887-zc27z 
Traceback (most recent call last):
  File "/opt/venv/lib/python3.10/site-packages/insightface/__init__.py", line 8, in <module>
    import onnxruntime
  File "/opt/venv/lib/python3.10/site-packages/onnxruntime/__init__.py", line 55, in <module>
    raise import_capi_exception
  File "/opt/venv/lib/python3.10/site-packages/onnxruntime/__init__.py", line 23, in <module>
    from onnxruntime.capi._pybind_state import (
  File "/opt/venv/lib/python3.10/site-packages/onnxruntime/capi/_pybind_state.py", line 33, in <module>
    from .onnxruntime_pybind11_state import *  # noqa
ImportError: /opt/venv/lib/python3.10/site-packages/onnxruntime/capi/onnxruntime_pybind11_state.cpython-310-x86_64-linux-gnu.so: cannot enable executable stack as shared object requires: Permission denied

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/src/app/src/main.py", line 6, in <module>
    from insightface.app import FaceAnalysis
  File "/opt/venv/lib/python3.10/site-packages/insightface/__init__.py", line 10, in <module>
    raise ImportError(
ImportError: Unable to import dependency onnxruntime

gcarrarom avatar May 19 '23 17:05 gcarrarom

Do you have selinux enabled? From a bit of googling it seems like that could cause the error you're getting.

bo0tzz avatar May 19 '23 18:05 bo0tzz

I've also had problems after updating to 1.56.1

DrSpaldo avatar May 20 '23 03:05 DrSpaldo

#2487 should fix this I believe

alextran1502 avatar May 20 '23 03:05 alextran1502

Amazing! 1.56.2 fixed it! Thank you!

gcarrarom avatar May 21 '23 01:05 gcarrarom

Sadly I need to reopen this bug for 1.57.1. Same error. Any ideas?

gcarrarom avatar May 23 '23 19:05 gcarrarom

Can you try remove the model cache, start up the pod and let it finish download the model before usage?

alextran1502 avatar May 24 '23 20:05 alextran1502

So, removed the files from the cache portion of the k8s deployment and the same error is happening with the ephemeral storage. It seems odd to run into such errors even though there is no cache whatsoever...

Traceback (most recent call last):
  File "/opt/venv/lib/python3.10/site-packages/insightface/__init__.py", line 8, in <module>
    import onnxruntime
  File "/opt/venv/lib/python3.10/site-packages/onnxruntime/__init__.py", line 55, in <module>
    raise import_capi_exception
  File "/opt/venv/lib/python3.10/site-packages/onnxruntime/__init__.py", line 23, in <module>
    from onnxruntime.capi._pybind_state import (
  File "/opt/venv/lib/python3.10/site-packages/onnxruntime/capi/_pybind_state.py", line 33, in <module>
    from .onnxruntime_pybind11_state import *  # noqa
ImportError: /opt/venv/lib/python3.10/site-packages/onnxruntime/capi/onnxruntime_pybind11_state.cpython-310-x86_64-linux-gnu.so: cannot enable executable stack as shared object requires: Permission denied

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/src/app/src/main.py", line 6, in <module>
    from insightface.app import FaceAnalysis
  File "/opt/venv/lib/python3.10/site-packages/insightface/__init__.py", line 10, in <module>
    raise ImportError(
ImportError: Unable to import dependency onnxruntime.

gcarrarom avatar May 24 '23 20:05 gcarrarom

Just did the update to 1.60.0 and it's still running into the same issue.

~~This permission denied issue makes me think it might be the permission of the downloaded files. I'll look into the user that download the modules and see if there's something going on there.~~ Nevermind. User seems to have all the permissions it needs. I'll try to debug more tonight.

gcarrarom avatar Jun 05 '23 13:06 gcarrarom

I have been getting this issue for the past few weeks as well. My server is still on 1.55.1, the last working version for me. I think @bo0tzz may be correct about SELinux permissions, as I do have SELinux enabled on my machines. What changed in between 1.55.1 and future versions that could cause this? Unfortunately disabling SELinux is not an option for me just to solve this one issue.

geraldwuhoo avatar Jun 07 '23 03:06 geraldwuhoo

Same happening with v1.61.0:

zsh ⌁ kgp                                          
NAME                                       READY   STATUS             RESTARTS         AGE
immich-machine-learning-6978bfdbdf-z8www   0/1     CrashLoopBackOff   2 (25s ago)      70s
Traceback (most recent call last):
  File "/opt/venv/lib/python3.11/site-packages/insightface/__init__.py", line 8, in <module>
    import onnxruntime
  File "/opt/venv/lib/python3.11/site-packages/onnxruntime/__init__.py", line 55, in <module>
    raise import_capi_exception
  File "/opt/venv/lib/python3.11/site-packages/onnxruntime/__init__.py", line 23, in <module>
    from onnxruntime.capi._pybind_state import ExecutionMode  # noqa: F401
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.11/site-packages/onnxruntime/capi/_pybind_state.py", line 33, in <module>
    from .onnxruntime_pybind11_state import *  # noqa
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ImportError: /opt/venv/lib/python3.11/site-packages/onnxruntime/capi/onnxruntime_pybind11_state.cpython-311-x86_64-linux-gnu.so: cannot enable executable stack as shared object requires: Permission denied

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/src/app/main.py", line 4, in <module>
    from cache import ModelCache
  File "/usr/src/app/cache.py", line 5, in <module>
    from models import get_model
  File "/usr/src/app/models.py", line 2, in <module>
    from insightface.app import FaceAnalysis
  File "/opt/venv/lib/python3.11/site-packages/insightface/__init__.py", line 10, in <module>
    raise ImportError(
ImportError: Unable to import dependency onnxruntime. 

gcarrarom avatar Jun 16 '23 15:06 gcarrarom

What changed in between 1.55.1 and future versions that could cause this?

v1.56.0 introduced face recognition, which I believe is what added the onnxruntime dependency.

bo0tzz avatar Jun 16 '23 15:06 bo0tzz

I am seeing a different error message when starting immich-machine-learning container (v1.61.0):

python: can't open file '/usr/src/app/src/main.py': [Errno 2] No such file or directory?

Is this a new issue?

nohitme avatar Jun 16 '23 15:06 nohitme

@nohitme that is an unrelated issue. Please make sure you're using the latest image and docker-compose.yml, and open a support thread in Discord or the Github Discussions if you still have trouble.

bo0tzz avatar Jun 16 '23 16:06 bo0tzz

Understand it could be a separate issue. I will verify it separately on the latest image (I am sure it was tho) and report it if it persists.

Thanks for the reply!

nohitme avatar Jun 16 '23 18:06 nohitme

Interesting.. Freshly built container image for machine learning from the main branch:

zsh ⌁ docker run -it test                    
Traceback (most recent call last):
  File "/opt/venv/lib/python3.11/site-packages/insightface/__init__.py", line 8, in <module>
    import onnxruntime
  File "/opt/venv/lib/python3.11/site-packages/onnxruntime/__init__.py", line 55, in <module>
    raise import_capi_exception
  File "/opt/venv/lib/python3.11/site-packages/onnxruntime/__init__.py", line 23, in <module>
    from onnxruntime.capi._pybind_state import ExecutionMode  # noqa: F401
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.11/site-packages/onnxruntime/capi/_pybind_state.py", line 33, in <module>
    from .onnxruntime_pybind11_state import *  # noqa
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ImportError: /opt/venv/lib/python3.11/site-packages/onnxruntime/capi/onnxruntime_pybind11_state.cpython-311-x86_64-linux-gnu.so: cannot enable executable stack as shared object requires: Permission denied

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/src/app/main.py", line 5, in <module>
    from cache import ModelCache
  File "/usr/src/app/cache.py", line 5, in <module>
    from models import get_model
  File "/usr/src/app/models.py", line 2, in <module>
    from insightface.app import FaceAnalysis
  File "/opt/venv/lib/python3.11/site-packages/insightface/__init__.py", line 10, in <module>
    raise ImportError(
ImportError: Unable to import dependency onnxruntime. 

It's not k8s specific then. I'll remove the multi-step build to check if there's something missing/permission mismatch that could be happening on the container build.

gcarrarom avatar Jun 19 '23 17:06 gcarrarom

Same error building with a simple pip install of the requirements. This is the container image I'm using:

FROM python:3.11.4-bullseye@sha256:5b401676aff858495a5c9c726c60b8b73fe52833e9e16eccdb59e93d52741727

ENV NODE_ENV=production \
  TRANSFORMERS_CACHE=/cache \
  PYTHONDONTWRITEBYTECODE=1 \
  PYTHONUNBUFFERED=1 \
  PATH="/opt/venv/bin:$PATH" \
  PYTHONPATH=`pwd` \
  PIP_NO_CACHE_DIR=true

WORKDIR /usr/src/app

COPY ./requirements.txt .
COPY app .
RUN pip install -r requirements.txt

ENTRYPOINT ["python", "main.py"]

It runs into the same problem, here's the directory it's trying to execute, it's owned by the root user:

zsh ⌁ docker run --entrypoint "" -it test bash     
root@41d3bfede331:/usr/src/app# cd /usr/local/lib/python3.11/site-packages/onnxruntime/capi/
root@41d3bfede331:/usr/local/lib/python3.11/site-packages/onnxruntime/capi# ls -al
total 14136
drwxr-xr-x. 1 root root      490 Jun 19 18:07 .
drwxr-xr-x. 1 root root      216 Jun 19 18:07 ..
-rw-r--r--. 1 root root      247 Jun 19 18:07 __init__.py
drwxr-xr-x. 1 root root      424 Jun 19 18:07 __pycache__
-rw-r--r--. 1 root root      406 Jun 19 18:07 _ld_preload.py
-rw-r--r--. 1 root root     1510 Jun 19 18:07 _pybind_state.py
-rwxr-xr-x. 1 root root    14216 Jun 19 18:07 libonnxruntime_providers_shared.so
-rw-r--r--. 1 root root     3965 Jun 19 18:07 onnxruntime_collect_build_info.py
-rw-r--r--. 1 root root    38714 Jun 19 18:07 onnxruntime_inference_collection.py
-rw-r--r--. 1 root root 14392120 Jun 19 18:07 onnxruntime_pybind11_state.cpython-311-x86_64-linux-gnu.so
-rw-r--r--. 1 root root     6237 Jun 19 18:07 onnxruntime_validation.py
drwxr-xr-x. 1 root root       44 Jun 19 18:07 training
root@41d3bfede331:/usr/local/lib/python3.11/site-packages/onnxruntime/capi# whoami
root

The onnxruntime_pybind11_state.cpython-311-x86_64-linux-gnu.so file is not executable though, that might be the problem.

gcarrarom avatar Jun 19 '23 18:06 gcarrarom

Only thing that I can se affecting this now is SELinux on the host running the container runtime for k3s. Makes me wonder what exactly is this package trying to access.

EDIT: I mean the package from onnxruntime. I'm trying to build it using their base image to account for that portion before building the python packages of this immich machine-learning image. Oddly enough their process to build is not working as intended. I will try to continue troubleshooting tomorrow.

gcarrarom avatar Jun 19 '23 18:06 gcarrarom

Error is slightly different now from the new version thanks to the update from #2951

zsh ⌁ klf immich-machine-learning-668757f9b6-jgmsq 
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/usr/src/app/main.py", line 12, in <module>
    from .models.base import InferenceModel
  File "/usr/src/app/models/__init__.py", line 1, in <module>
    from .clip import CLIPSTEncoder
  File "/usr/src/app/models/clip.py", line 8, in <module>
    from .base import InferenceModel
  File "/usr/src/app/models/base.py", line 8, in <module>
    from onnxruntime.capi.onnxruntime_pybind11_state import InvalidProtobuf  # type: ignore
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.11/site-packages/onnxruntime/__init__.py", line 55, in <module>
    raise import_capi_exception
  File "/opt/venv/lib/python3.11/site-packages/onnxruntime/__init__.py", line 23, in <module>
    from onnxruntime.capi._pybind_state import ExecutionMode  # noqa: F401
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.11/site-packages/onnxruntime/capi/_pybind_state.py", line 33, in <module>
    from .onnxruntime_pybind11_state import *  # noqa
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ImportError: /opt/venv/lib/python3.11/site-packages/onnxruntime/capi/onnxruntime_pybind11_state.cpython-311-x86_64-linux-gnu.so: cannot enable executable stack as shared object requires: Permission denied

I will try to make a few tweaks on the fsgroup in k8s and see if it helps.

gcarrarom avatar Jun 30 '23 15:06 gcarrarom

Since this is SElinux not liking a dependency that I think we can't really do without (cc @mertalev?), I don't believe there is much we can do about this from the Immich side.

bo0tzz avatar Jun 30 '23 16:06 bo0tzz

Kinda? I mean, the files are labeled as such in the container by default:

root@code-658d97b879-j2f6g:/usr/src/app# ls -Z /opt/venv/lib/python3.11/site-packages/onnxruntime/capi/
system_u:object_r:var_lib_t:s0 __init__.py                         system_u:object_r:var_lib_t:s0 onnxruntime_inference_collection.py
system_u:object_r:var_lib_t:s0 _ld_preload.py                      system_u:object_r:var_lib_t:s0 onnxruntime_pybind11_state.cpython-311-x86_64-linux-gnu.so
system_u:object_r:var_lib_t:s0 _pybind_state.py                    system_u:object_r:var_lib_t:s0 onnxruntime_validation.py
system_u:object_r:var_lib_t:s0 libonnxruntime_providers_shared.so  system_u:object_r:var_lib_t:s0 training
system_u:object_r:var_lib_t:s0 onnxruntime_collect_build_info.py

Sorry, now that I think about it, those labels are probably coming from the installation of the onnx dotnet runtime. Problem is how it gets flagged on the selinux level at the host:

type=AVC msg=audit(1688149775.122:26827): avc:  denied  { execstack } for  pid=13404 comm="python" scontext=system_u:system_r:unconfined_service_t:s0 tcontext=system_u:system_r:unconfined_service_t:s0 tclass=process permissive=0

I guess we could get the entrypoint of the container to change it? But that would mean running some sort of init container that could re-label those. I haven't had much time to look into it, sorry, but maybe I could play around with those labels and get a workaround for us.

gcarrarom avatar Jun 30 '23 18:06 gcarrarom

Aha! That seems to be k3s that didn't enable selinux integration by default:

system_u:system_r:unconfined_service_t:s0 30095 ? 00:01:23 longhorn

All pods are coming up as unconfined_service_t Seems to be fixed by enabling the configuration on the node level: https://github.com/k3s-io/k3s/issues/533

It is weird that it should've been done by default. I'll look into it and report back to reference for anyone else that is also looking into it.

gcarrarom avatar Jun 30 '23 18:06 gcarrarom

I can confirm, adding a proper label to the kubernetes containers allowed the execution to work properly. My instance is now running just fine for all machine learning tasks: CleanShot 2023-06-30 at 16 56 16

Thanks very much for the amazing project!

gcarrarom avatar Jun 30 '23 20:06 gcarrarom

I am running the stack via docker-compose and I am using the latest docker-compose.yml. I am experiencing the same issue as described above.

The immich-machine-learning container runs into this issue at startup:

Traceback (most recent call last):
File "/opt/venv/lib/python3.10/site-packages/insightface/init.py", line 8, in
import onnxruntime
File "/opt/venv/lib/python3.10/site-packages/onnxruntime/init.py", line 55, in
raise import_capi_exception
File "/opt/venv/lib/python3.10/site-packages/onnxruntime/init.py", line 23, in
from onnxruntime.capi._pybind_state import (
File "/opt/venv/lib/python3.10/site-packages/onnxruntime/capi/_pybind_state.py", line 33, in
from .onnxruntime_pybind11_state import * # noqa
ImportError: /opt/venv/lib/python3.10/site-packages/onnxruntime/capi/onnxruntime_pybind11_state.cpython-310-x86_64-linux-gnu.so: cannot enable executable stack as shared object requires: Permission denied

I am not really sure how I can solve this issue.

nkay08 avatar Dec 27 '23 13:12 nkay08