OpenVINO ML pod crashes on jobs triggered by uploads but works fine on manually triggered jobs
I have searched the existing issues, both open and closed, to make sure this is not a duplicate report.
- [x] Yes
The bug
when I upload an image to immich and the server automatically dispatches a job to do face detection, the openvino ML pod crashes and the server is unable to complete the job. but if I go to the admin panel and then manually trigger a face detention job it works flawlessly
I am using OpenVINO with my CPU's integrated graphics and I know that can lead to more issues than an external GPU. but this issue feels more strange: the functionality does work, just not when triggered automatically on upload by the server itself
The OS that Immich Server is running on
Arch / Kubernetes
Version of Immich Server
v1.134.0
Version of Immich Mobile App
v1.134.0
Platform with the issue
- [x] Server
- [ ] Web
- [ ] Mobile
Your docker-compose.yml content
# this isn't a docker-compose.yml file because I don't use docker-compose. I use helm. these are the helm values.
env:
DB_HOSTNAME: cnpg-rw.immich.svc.cluster.local
DB_USERNAME: postgres
DB_PASSWORD:
valueFrom:
secretKeyRef:
name: cnpg-auth
key: password
REDIS_PASSWORD:
valueFrom:
secretKeyRef:
name: redis-auth
key: redis-password
image:
tag: v1.134.0
immich:
persistence:
library:
existingClaim: media
redis:
enabled: true
architecture: standalone
auth:
enabled: true
existingSecret: redis-auth
existingSecretPasswordKey: redis-password
server:
enabled: true
resources:
limits:
gpu.intel.com/i915: 1
ingress:
<redacted>
machine-learning:
enabled: true
image:
repository: ghcr.io/immich-app/immich-machine-learning
tag: v1.134.0-openvino
resources:
limits:
gpu.intel.com/i915: 1
persistence:
cache:
enabled: true
type: pvc
existingClaim: cache
Your .env content
n/a, see above.
Reproduction steps
- upload an image
- the pod crashes / restarts and the job fails
- go to admin panel and trigger the job manually, it works
Relevant log output
the ML pods logs until it crashes (no intersting log):
[05/28/25 07:42:58] INFO Starting gunicorn 23.0.0
[05/28/25 07:42:58] INFO Listening at: http://[::]:3003 (8)
[05/28/25 07:42:58] INFO Using worker: immich_ml.config.CustomUvicornWorker
[05/28/25 07:42:58] INFO Booting worker with pid: 9
[05/28/25 07:43:00] INFO Started server process [9]
[05/28/25 07:43:00] INFO Waiting for application startup.
[05/28/25 07:43:00] INFO Created in-memory cache with unloading after 300s
of inactivity.
[05/28/25 07:43:00] INFO Initialized request thread pool with 20 threads.
[05/28/25 07:43:00] INFO Application startup complete.
[05/28/25 10:04:01] INFO Loading detection model 'buffalo_l' to memory
[05/28/25 10:04:01] INFO Setting execution providers to
['OpenVINOExecutionProvider',
'CPUExecutionProvider'], in descending order of
preference
[05/28/25 10:04:01] INFO Loading visual model 'ViT-B-32__openai' to memory
[05/28/25 10:04:01] INFO Setting execution providers to
['OpenVINOExecutionProvider',
'CPUExecutionProvider'], in descending order of
preference
the logs of it restarting:
[05/28/25 10:04:30] INFO Starting gunicorn 23.0.0
[05/28/25 10:04:30] INFO Listening at: http://[::]:3003 (8)
[05/28/25 10:04:30] INFO Using worker: immich_ml.config.CustomUvicornWorker
[05/28/25 10:04:30] INFO Booting worker with pid: 9
[05/28/25 10:04:31] INFO Started server process [9]
[05/28/25 10:04:31] INFO Waiting for application startup.
[05/28/25 10:04:31] INFO Created in-memory cache with unloading after 300s
of inactivity.
[05/28/25 10:04:31] INFO Initialized request thread pool with 20 threads.
[05/28/25 10:04:31] INFO Application startup complete.
and then from the server:
[Nest] 7 - 05/28/2025, 10:04:24 AM ERROR [Microservices:{"source":"upload","id":"7e6aeae2-7c34-4ca7-b783-f43491dbe0d6"}] Unable to run job handler (face-detection): Error: Machine learning request '{"facial-recognition":{"detection":{"modelName":"buffalo_l","options":{"minScore":0.7}},"recognition":{"modelName":"buffalo_l"}}}' failed for all URLs
Error: Machine learning request '{"facial-recognition":{"detection":{"modelName":"buffalo_l","options":{"minScore":0.7}},"recognition":{"modelName":"buffalo_l"}}}' failed for all URLs
at MachineLearningRepository.predict (/usr/src/app/dist/repositories/machine-learning.repository.js:98:15)
at async MachineLearningRepository.detectFaces (/usr/src/app/dist/repositories/machine-learning.repository.js:107:26)
at async PersonService.handleDetectFaces (/usr/src/app/dist/services/person.service.js:232:52)
at async JobService.onJobStart (/usr/src/app/dist/services/job.service.js:166:28)
at async EventRepository.onEvent (/usr/src/app/dist/repositories/event.repository.js:126:13)
at async /usr/src/app/node_modules/bullmq/dist/cjs/classes/worker.js:526:32
at async Worker.retryIfFailed (/usr/src/app/node_modules/bullmq/dist/cjs/classes/worker.js:751:24)
[Nest] 7 - 05/28/2025, 10:04:24 AM ERROR [Microservices:{"source":"upload","id":"7e6aeae2-7c34-4ca7-b783-f43491dbe0d6"}] Unable to run job handler (smart-search): Error: Machine learning request '{"clip":{"visual":{"modelName":"ViT-B-32__openai"}}}' failed for all URLs
Error: Machine learning request '{"clip":{"visual":{"modelName":"ViT-B-32__openai"}}}' failed for all URLs
at MachineLearningRepository.predict (/usr/src/app/dist/repositories/machine-learning.repository.js:98:15)
at async MachineLearningRepository.encodeImage (/usr/src/app/dist/repositories/machine-learning.repository.js:116:26)
at async SmartInfoService.handleEncodeClip (/usr/src/app/dist/services/smart-info.service.js:91:27)
at async JobService.onJobStart (/usr/src/app/dist/services/job.service.js:166:28)
at async EventRepository.onEvent (/usr/src/app/dist/repositories/event.repository.js:126:13)
at async /usr/src/app/node_modules/bullmq/dist/cjs/classes/worker.js:526:32
at async Worker.retryIfFailed (/usr/src/app/node_modules/bullmq/dist/cjs/classes/worker.js:751:24)
[Nest] 7 - 05/28/2025, 10:04:24 AM ERROR [Microservices:{"source":"upload","id":"87420a98-ed41-4afa-892e-6a16d96b43cc"}] Unable to run job handler (smart-search): Error: Machine learning request '{"clip":{"visual":{"modelName":"ViT-B-32__openai"}}}' failed for all URLs
Error: Machine learning request '{"clip":{"visual":{"modelName":"ViT-B-32__openai"}}}' failed for all URLs
at MachineLearningRepository.predict (/usr/src/app/dist/repositories/machine-learning.repository.js:98:15)
at async MachineLearningRepository.encodeImage (/usr/src/app/dist/repositories/machine-learning.repository.js:116:26)
at async SmartInfoService.handleEncodeClip (/usr/src/app/dist/services/smart-info.service.js:91:27)
at async JobService.onJobStart (/usr/src/app/dist/services/job.service.js:166:28)
at async EventRepository.onEvent (/usr/src/app/dist/repositories/event.repository.js:126:13)
at async /usr/src/app/node_modules/bullmq/dist/cjs/classes/worker.js:526:32
at async Worker.retryIfFailed (/usr/src/app/node_modules/bullmq/dist/cjs/classes/worker.js:751:24)
[Nest] 7 - 05/28/2025, 10:04:24 AM ERROR [Microservices:{"source":"upload","id":"87420a98-ed41-4afa-892e-6a16d96b43cc"}] Unable to run job handler (face-detection): Error: Machine learning request '{"facial-recognition":{"detection":{"modelName":"buffalo_l","options":{"minScore":0.7}},"recognition":{"modelName":"buffalo_l"}}}' failed for all URLs
Error: Machine learning request '{"facial-recognition":{"detection":{"modelName":"buffalo_l","options":{"minScore":0.7}},"recognition":{"modelName":"buffalo_l"}}}' failed for all URLs
at MachineLearningRepository.predict (/usr/src/app/dist/repositories/machine-learning.repository.js:98:15)
at async MachineLearningRepository.detectFaces (/usr/src/app/dist/repositories/machine-learning.repository.js:107:26)
at async PersonService.handleDetectFaces (/usr/src/app/dist/services/person.service.js:232:52)
at async JobService.onJobStart (/usr/src/app/dist/services/job.service.js:166:28)
at async EventRepository.onEvent (/usr/src/app/dist/repositories/event.repository.js:126:13)
at async /usr/src/app/node_modules/bullmq/dist/cjs/classes/worker.js:526:32
at async Worker.retryIfFailed (/usr/src/app/node_modules/bullmq/dist/cjs/classes/worker.js:751:24)
Additional information
No response
I am using OpenVINO with my CPU's integrated graphics and I know that can lead to more issues than an external GPU
I hit the same issue with a dedicated GPU (Arc A380) so shouldn't be related to integrated graphics.
I'm running Immich via Podman Quadlets but I'm 99% sure that shouldn't be the issue (at least it never was an issue before)
Same issue here. Running on Docker v4.42.0 on Windows with dedicated RTX 3090.
Are you sure this isn't a case of the container reaching resource limits / running out of memory?
Also @redaws00, it's unlikely that an issue with your GPU has the same root cause as the OP using OpenVINO.
@mertalev I'm new to this application...and Docker on Windows in general. Is there a good way to get logs to look at? My current work around is to have machine learning in a separate container that I disable and enable when needed. When functioning together, both the web and ML crash.
You can post the machine learning logs with docker logs immich_machine_learning.
In general, Docker on Windows can be convenient, but it's also much less stable than Docker on Linux. Issues can often be fixed by upgrading to the latest Docker Desktop version and latest NVIDIA driver, or even just restarting the PC (a classic).
I tried that already. Unfortunately, my 3090 machine is Windows and i have to keep it that way (being a .Net Developer). Docker logs are no good because once the container locks up, all I get is the following: request returned 500 Internal Server Error for API route and version http://%2F%2F.%2Fpipe%2FdockerDesktopLinuxEngine/v1.49/containers/7fa6423c94830c0969a92a92bebf2dbe2e3647261523a99c596350e979d20327/json, check if the server supports the requested API version
Are there any logs stored in the container?
Are you sure this isn't a case of the container reaching resource limits / running out of memory?
I just tried uploading something and had the memory usage side-by-side, it doesn't really spike and there's nothing in the kubectl events about it getting OOM'd or something - it just immediately starts failing probes
0s Warning Unhealthy pod/immich-machine-learning-f6c9d559b-p55tx Liveness probe failed: Get "http://10.0.0.100:3003/ping": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
0s Warning Unhealthy pod/immich-machine-learning-f6c9d559b-p55tx Readiness probe failed: Get "http://10.0.0.100:3003/ping": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
and eventually gets killed. I also don't get why it would reach resource limits given it's fine when triggered manually.
That's pretty weird. There should be no difference between what the job does on manual queue vs on upload, and no difference in what it sends to the machine learning service. I would have to guess it's a timing issue of some kind, but I'm not sure where that would be either.
Are you sure this isn't a case of the container reaching resource limits / running out of memory?
I just tried uploading something and had the memory usage side-by-side, it doesn't really spike and there's nothing in the kubectl events about it getting OOM'd or something - it just immediately starts failing probes
0s Warning Unhealthy pod/immich-machine-learning-f6c9d559b-p55tx Liveness probe failed: Get "http://10.0.0.100:3003/ping": context deadline exceeded (Client.Timeout exceeded while awaiting headers) 0s Warning Unhealthy pod/immich-machine-learning-f6c9d559b-p55tx Readiness probe failed: Get "http://10.0.0.100:3003/ping": context deadline exceeded (Client.Timeout exceeded while awaiting headers)and eventually gets killed. I also don't get why it would reach resource limits given it's fine when triggered manually.
I'm seeing exactly the same: the /ping endpoint stops functioning as soon as I trigger e.g. a smart search. Kubernetes then kills the pod after a while.
The likely reason is that the model has to be compiled to OpenVINO format the first time it's loaded, which (unfortunately) blocks the ML server from responding to any requests until it's done.
About how long should it take to compile? I realize that times differ system to system.
Typically a few minutes, at most maybe 15 in a very constrained environment with a large model.
I don't think that's the problem I'm having. My current solution is to separate the ML from the main app and only run the ML once a week while manually triggering the jobs. It still occasionally locks up all of Docker (on Windows), but I can get the jobs completed.