immich
immich copied to clipboard
[BUG] Machinelearning crashing on k8s deployment v1.57.1 - v1.62.0
The bug
There's a bug when using version 1.56.1 on kubernetes using the official helm chart:
zsh β klf immich-machine-learning-54b5766488-kx4b4
Traceback (most recent call last):
File "/opt/venv/lib/python3.10/site-packages/insightface/init.py", line 8, in
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/src/app/src/main.py", line 6, in
The OS that Immich Server is running on
Kubernetes - k3s - MicroOS
Version of Immich Server
v1.56.1
Version of Immich Mobile App
N/A
Platform with the issue
- [X] Server
- [ ] Web
- [ ] Mobile
Your docker-compose.yml content
postgresql:
enabled: true
redis:
enabled: true
typesense:
enabled: true
persistence:
tsdata:
enabled: true
existingClaim: typesense-data
machine-learning:
persistence:
cache:
enabled: true
existingClaim: machinelearning-data
proxy:
ingress:
main:
enabled: true
ingressClassName: nginx
annotations:
nginx.ingress.kubernetes.io/proxy-body-size: "0"
cert-manager.io/cluster-issuer: letsencrypt
hosts:
- host: my.domain.com
paths:
- path: "/"
tls:
- hosts:
- my.domain.com
secretName: my-domain-com
image:
tag: v1.56.1
immich:
persistence:
library:
existingClaim: photos
Your .env content
postgresql:
enabled: true
redis:
enabled: true
typesense:
enabled: true
persistence:
tsdata:
enabled: true
existingClaim: typesense-data
machine-learning:
persistence:
cache:
enabled: true
existingClaim: machinelearning-data
proxy:
ingress:
main:
enabled: true
ingressClassName: nginx
annotations:
nginx.ingress.kubernetes.io/proxy-body-size: "0"
cert-manager.io/cluster-issuer: letsencrypt
hosts:
- host: my.domain.com
paths:
- path: "/"
tls:
- hosts:
- my.domain.com
secretName: my-domain-com
image:
tag: v1.56.1
immich:
persistence:
library:
existingClaim: photos
Reproduction steps
1. Deploy the helm chart on that version
2. Wait for all pods to come up and machinelearning to crash.
Additional information
No response
Just to update: I've now rolled back to 1.56.0 and it's working flawlessly. It's probably a bug introduced on 1.56.1.
HMm from 1.56.0 to 1.56.1 we only changed the server and the web related code π€
There have been a few reports of related onyx runtime errors that have been fixed by delete the machine learning cache volume. Rolling back versions might have done that in your situation.
Great to know, I'll try pushing 1.56.1 again and clear the cache. I should report back in a few hours.
Odd, Just upgraded to 1.56.1 and still the same error. I've removed the emptyDir cache folder and the error persists. Tried creating it using another storage class and same issue. Could it be something else in another directory? Same error here:
zsh β klf immich-machine-learning-5d4b859887-zc27z
Traceback (most recent call last):
File "/opt/venv/lib/python3.10/site-packages/insightface/__init__.py", line 8, in <module>
import onnxruntime
File "/opt/venv/lib/python3.10/site-packages/onnxruntime/__init__.py", line 55, in <module>
raise import_capi_exception
File "/opt/venv/lib/python3.10/site-packages/onnxruntime/__init__.py", line 23, in <module>
from onnxruntime.capi._pybind_state import (
File "/opt/venv/lib/python3.10/site-packages/onnxruntime/capi/_pybind_state.py", line 33, in <module>
from .onnxruntime_pybind11_state import * # noqa
ImportError: /opt/venv/lib/python3.10/site-packages/onnxruntime/capi/onnxruntime_pybind11_state.cpython-310-x86_64-linux-gnu.so: cannot enable executable stack as shared object requires: Permission denied
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/src/app/src/main.py", line 6, in <module>
from insightface.app import FaceAnalysis
File "/opt/venv/lib/python3.10/site-packages/insightface/__init__.py", line 10, in <module>
raise ImportError(
ImportError: Unable to import dependency onnxruntime
Do you have selinux enabled? From a bit of googling it seems like that could cause the error you're getting.
I've also had problems after updating to 1.56.1
#2487 should fix this I believe
Amazing! 1.56.2 fixed it! Thank you!
Sadly I need to reopen this bug for 1.57.1. Same error. Any ideas?
Can you try remove the model cache, start up the pod and let it finish download the model before usage?
So, removed the files from the cache portion of the k8s deployment and the same error is happening with the ephemeral storage. It seems odd to run into such errors even though there is no cache whatsoever...
Traceback (most recent call last):
File "/opt/venv/lib/python3.10/site-packages/insightface/__init__.py", line 8, in <module>
import onnxruntime
File "/opt/venv/lib/python3.10/site-packages/onnxruntime/__init__.py", line 55, in <module>
raise import_capi_exception
File "/opt/venv/lib/python3.10/site-packages/onnxruntime/__init__.py", line 23, in <module>
from onnxruntime.capi._pybind_state import (
File "/opt/venv/lib/python3.10/site-packages/onnxruntime/capi/_pybind_state.py", line 33, in <module>
from .onnxruntime_pybind11_state import * # noqa
ImportError: /opt/venv/lib/python3.10/site-packages/onnxruntime/capi/onnxruntime_pybind11_state.cpython-310-x86_64-linux-gnu.so: cannot enable executable stack as shared object requires: Permission denied
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/src/app/src/main.py", line 6, in <module>
from insightface.app import FaceAnalysis
File "/opt/venv/lib/python3.10/site-packages/insightface/__init__.py", line 10, in <module>
raise ImportError(
ImportError: Unable to import dependency onnxruntime.
Just did the update to 1.60.0 and it's still running into the same issue.
~~This permission denied issue makes me think it might be the permission of the downloaded files. I'll look into the user that download the modules and see if there's something going on there.~~ Nevermind. User seems to have all the permissions it needs. I'll try to debug more tonight.
I have been getting this issue for the past few weeks as well. My server is still on 1.55.1, the last working version for me. I think @bo0tzz may be correct about SELinux permissions, as I do have SELinux enabled on my machines. What changed in between 1.55.1 and future versions that could cause this? Unfortunately disabling SELinux is not an option for me just to solve this one issue.
Same happening with v1.61.0:
zsh β kgp
NAME READY STATUS RESTARTS AGE
immich-machine-learning-6978bfdbdf-z8www 0/1 CrashLoopBackOff 2 (25s ago) 70s
Traceback (most recent call last):
File "/opt/venv/lib/python3.11/site-packages/insightface/__init__.py", line 8, in <module>
import onnxruntime
File "/opt/venv/lib/python3.11/site-packages/onnxruntime/__init__.py", line 55, in <module>
raise import_capi_exception
File "/opt/venv/lib/python3.11/site-packages/onnxruntime/__init__.py", line 23, in <module>
from onnxruntime.capi._pybind_state import ExecutionMode # noqa: F401
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/venv/lib/python3.11/site-packages/onnxruntime/capi/_pybind_state.py", line 33, in <module>
from .onnxruntime_pybind11_state import * # noqa
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ImportError: /opt/venv/lib/python3.11/site-packages/onnxruntime/capi/onnxruntime_pybind11_state.cpython-311-x86_64-linux-gnu.so: cannot enable executable stack as shared object requires: Permission denied
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/src/app/main.py", line 4, in <module>
from cache import ModelCache
File "/usr/src/app/cache.py", line 5, in <module>
from models import get_model
File "/usr/src/app/models.py", line 2, in <module>
from insightface.app import FaceAnalysis
File "/opt/venv/lib/python3.11/site-packages/insightface/__init__.py", line 10, in <module>
raise ImportError(
ImportError: Unable to import dependency onnxruntime.
What changed in between 1.55.1 and future versions that could cause this?
v1.56.0 introduced face recognition, which I believe is what added the onnxruntime dependency.
I am seeing a different error message when starting immich-machine-learning container (v1.61.0):
python: can't open file '/usr/src/app/src/main.py': [Errno 2] No such file or directory
?
Is this a new issue?
@nohitme that is an unrelated issue. Please make sure you're using the latest image and docker-compose.yml, and open a support thread in Discord or the Github Discussions if you still have trouble.
Understand it could be a separate issue. I will verify it separately on the latest image (I am sure it was tho) and report it if it persists.
Thanks for the reply!
Interesting.. Freshly built container image for machine learning from the main branch:
zsh β docker run -it test
Traceback (most recent call last):
File "/opt/venv/lib/python3.11/site-packages/insightface/__init__.py", line 8, in <module>
import onnxruntime
File "/opt/venv/lib/python3.11/site-packages/onnxruntime/__init__.py", line 55, in <module>
raise import_capi_exception
File "/opt/venv/lib/python3.11/site-packages/onnxruntime/__init__.py", line 23, in <module>
from onnxruntime.capi._pybind_state import ExecutionMode # noqa: F401
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/venv/lib/python3.11/site-packages/onnxruntime/capi/_pybind_state.py", line 33, in <module>
from .onnxruntime_pybind11_state import * # noqa
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ImportError: /opt/venv/lib/python3.11/site-packages/onnxruntime/capi/onnxruntime_pybind11_state.cpython-311-x86_64-linux-gnu.so: cannot enable executable stack as shared object requires: Permission denied
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/src/app/main.py", line 5, in <module>
from cache import ModelCache
File "/usr/src/app/cache.py", line 5, in <module>
from models import get_model
File "/usr/src/app/models.py", line 2, in <module>
from insightface.app import FaceAnalysis
File "/opt/venv/lib/python3.11/site-packages/insightface/__init__.py", line 10, in <module>
raise ImportError(
ImportError: Unable to import dependency onnxruntime.
It's not k8s specific then. I'll remove the multi-step build to check if there's something missing/permission mismatch that could be happening on the container build.
Same error building with a simple pip install of the requirements. This is the container image I'm using:
FROM python:3.11.4-bullseye@sha256:5b401676aff858495a5c9c726c60b8b73fe52833e9e16eccdb59e93d52741727
ENV NODE_ENV=production \
TRANSFORMERS_CACHE=/cache \
PYTHONDONTWRITEBYTECODE=1 \
PYTHONUNBUFFERED=1 \
PATH="/opt/venv/bin:$PATH" \
PYTHONPATH=`pwd` \
PIP_NO_CACHE_DIR=true
WORKDIR /usr/src/app
COPY ./requirements.txt .
COPY app .
RUN pip install -r requirements.txt
ENTRYPOINT ["python", "main.py"]
It runs into the same problem, here's the directory it's trying to execute, it's owned by the root user:
zsh β docker run --entrypoint "" -it test bash
root@41d3bfede331:/usr/src/app# cd /usr/local/lib/python3.11/site-packages/onnxruntime/capi/
root@41d3bfede331:/usr/local/lib/python3.11/site-packages/onnxruntime/capi# ls -al
total 14136
drwxr-xr-x. 1 root root 490 Jun 19 18:07 .
drwxr-xr-x. 1 root root 216 Jun 19 18:07 ..
-rw-r--r--. 1 root root 247 Jun 19 18:07 __init__.py
drwxr-xr-x. 1 root root 424 Jun 19 18:07 __pycache__
-rw-r--r--. 1 root root 406 Jun 19 18:07 _ld_preload.py
-rw-r--r--. 1 root root 1510 Jun 19 18:07 _pybind_state.py
-rwxr-xr-x. 1 root root 14216 Jun 19 18:07 libonnxruntime_providers_shared.so
-rw-r--r--. 1 root root 3965 Jun 19 18:07 onnxruntime_collect_build_info.py
-rw-r--r--. 1 root root 38714 Jun 19 18:07 onnxruntime_inference_collection.py
-rw-r--r--. 1 root root 14392120 Jun 19 18:07 onnxruntime_pybind11_state.cpython-311-x86_64-linux-gnu.so
-rw-r--r--. 1 root root 6237 Jun 19 18:07 onnxruntime_validation.py
drwxr-xr-x. 1 root root 44 Jun 19 18:07 training
root@41d3bfede331:/usr/local/lib/python3.11/site-packages/onnxruntime/capi# whoami
root
The onnxruntime_pybind11_state.cpython-311-x86_64-linux-gnu.so
file is not executable though, that might be the problem.
Only thing that I can se affecting this now is SELinux on the host running the container runtime for k3s. Makes me wonder what exactly is this package trying to access.
EDIT: I mean the package from onnxruntime. I'm trying to build it using their base image to account for that portion before building the python packages of this immich machine-learning image. Oddly enough their process to build is not working as intended. I will try to continue troubleshooting tomorrow.
Error is slightly different now from the new version thanks to the update from #2951
zsh β klf immich-machine-learning-668757f9b6-jgmsq
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/usr/src/app/main.py", line 12, in <module>
from .models.base import InferenceModel
File "/usr/src/app/models/__init__.py", line 1, in <module>
from .clip import CLIPSTEncoder
File "/usr/src/app/models/clip.py", line 8, in <module>
from .base import InferenceModel
File "/usr/src/app/models/base.py", line 8, in <module>
from onnxruntime.capi.onnxruntime_pybind11_state import InvalidProtobuf # type: ignore
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/venv/lib/python3.11/site-packages/onnxruntime/__init__.py", line 55, in <module>
raise import_capi_exception
File "/opt/venv/lib/python3.11/site-packages/onnxruntime/__init__.py", line 23, in <module>
from onnxruntime.capi._pybind_state import ExecutionMode # noqa: F401
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/venv/lib/python3.11/site-packages/onnxruntime/capi/_pybind_state.py", line 33, in <module>
from .onnxruntime_pybind11_state import * # noqa
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ImportError: /opt/venv/lib/python3.11/site-packages/onnxruntime/capi/onnxruntime_pybind11_state.cpython-311-x86_64-linux-gnu.so: cannot enable executable stack as shared object requires: Permission denied
I will try to make a few tweaks on the fsgroup in k8s and see if it helps.
Since this is SElinux not liking a dependency that I think we can't really do without (cc @mertalev?), I don't believe there is much we can do about this from the Immich side.
Kinda? I mean, the files are labeled as such in the container by default:
root@code-658d97b879-j2f6g:/usr/src/app# ls -Z /opt/venv/lib/python3.11/site-packages/onnxruntime/capi/
system_u:object_r:var_lib_t:s0 __init__.py system_u:object_r:var_lib_t:s0 onnxruntime_inference_collection.py
system_u:object_r:var_lib_t:s0 _ld_preload.py system_u:object_r:var_lib_t:s0 onnxruntime_pybind11_state.cpython-311-x86_64-linux-gnu.so
system_u:object_r:var_lib_t:s0 _pybind_state.py system_u:object_r:var_lib_t:s0 onnxruntime_validation.py
system_u:object_r:var_lib_t:s0 libonnxruntime_providers_shared.so system_u:object_r:var_lib_t:s0 training
system_u:object_r:var_lib_t:s0 onnxruntime_collect_build_info.py
Sorry, now that I think about it, those labels are probably coming from the installation of the onnx dotnet runtime. Problem is how it gets flagged on the selinux level at the host:
type=AVC msg=audit(1688149775.122:26827): avc: denied { execstack } for pid=13404 comm="python" scontext=system_u:system_r:unconfined_service_t:s0 tcontext=system_u:system_r:unconfined_service_t:s0 tclass=process permissive=0
I guess we could get the entrypoint of the container to change it? But that would mean running some sort of init container that could re-label those. I haven't had much time to look into it, sorry, but maybe I could play around with those labels and get a workaround for us.
Aha! That seems to be k3s that didn't enable selinux integration by default:
system_u:system_r:unconfined_service_t:s0 30095 ? 00:01:23 longhorn
All pods are coming up as unconfined_service_t
Seems to be fixed by enabling the configuration on the node level: https://github.com/k3s-io/k3s/issues/533
It is weird that it should've been done by default. I'll look into it and report back to reference for anyone else that is also looking into it.
I can confirm, adding a proper label to the kubernetes containers allowed the execution to work properly. My instance is now running just fine for all machine learning tasks:
Thanks very much for the amazing project!
I am running the stack via docker-compose and I am using the latest docker-compose.yml. I am experiencing the same issue as described above.
The immich-machine-learning
container runs into this issue at startup:
Traceback (most recent call last):
File "/opt/venv/lib/python3.10/site-packages/insightface/init.py", line 8, in
import onnxruntime
File "/opt/venv/lib/python3.10/site-packages/onnxruntime/init.py", line 55, in
raise import_capi_exception
File "/opt/venv/lib/python3.10/site-packages/onnxruntime/init.py", line 23, in
from onnxruntime.capi._pybind_state import (
File "/opt/venv/lib/python3.10/site-packages/onnxruntime/capi/_pybind_state.py", line 33, in
from .onnxruntime_pybind11_state import * # noqa
ImportError: /opt/venv/lib/python3.10/site-packages/onnxruntime/capi/onnxruntime_pybind11_state.cpython-310-x86_64-linux-gnu.so: cannot enable executable stack as shared object requires: Permission denied
I am not really sure how I can solve this issue.