immich icon indicating copy to clipboard operation
immich copied to clipboard

OpenVINO ML pod crashes on jobs triggered by uploads but works fine on manually triggered jobs

Open againstpetra opened this issue 7 months ago • 8 comments

I have searched the existing issues, both open and closed, to make sure this is not a duplicate report.

  • [x] Yes

The bug

when I upload an image to immich and the server automatically dispatches a job to do face detection, the openvino ML pod crashes and the server is unable to complete the job. but if I go to the admin panel and then manually trigger a face detention job it works flawlessly

I am using OpenVINO with my CPU's integrated graphics and I know that can lead to more issues than an external GPU. but this issue feels more strange: the functionality does work, just not when triggered automatically on upload by the server itself

The OS that Immich Server is running on

Arch / Kubernetes

Version of Immich Server

v1.134.0

Version of Immich Mobile App

v1.134.0

Platform with the issue

  • [x] Server
  • [ ] Web
  • [ ] Mobile

Your docker-compose.yml content

# this isn't a docker-compose.yml file because I don't use docker-compose. I use helm. these are the helm values.
env:
  DB_HOSTNAME: cnpg-rw.immich.svc.cluster.local
  DB_USERNAME: postgres
  DB_PASSWORD:
    valueFrom:
      secretKeyRef:
        name: cnpg-auth
        key: password
  REDIS_PASSWORD:
    valueFrom:
      secretKeyRef:
        name: redis-auth
        key: redis-password
image:
  tag: v1.134.0
immich:
  persistence:
    library:
      existingClaim: media
redis:
  enabled: true
  architecture: standalone
  auth:
    enabled: true
    existingSecret: redis-auth
    existingSecretPasswordKey: redis-password
server:
  enabled: true
  resources:
    limits:
      gpu.intel.com/i915: 1
    ingress:
      <redacted>
machine-learning:
  enabled: true
  image:
    repository: ghcr.io/immich-app/immich-machine-learning
    tag: v1.134.0-openvino
  resources:
    limits:
      gpu.intel.com/i915: 1
  persistence:
    cache:
      enabled: true
      type: pvc
      existingClaim: cache

Your .env content

n/a, see above.

Reproduction steps

  1. upload an image
  2. the pod crashes / restarts and the job fails
  3. go to admin panel and trigger the job manually, it works

Relevant log output

the ML pods logs until it crashes (no intersting log):

[05/28/25 07:42:58] INFO     Starting gunicorn 23.0.0                           
[05/28/25 07:42:58] INFO     Listening at: http://[::]:3003 (8)                 
[05/28/25 07:42:58] INFO     Using worker: immich_ml.config.CustomUvicornWorker 
[05/28/25 07:42:58] INFO     Booting worker with pid: 9                         
[05/28/25 07:43:00] INFO     Started server process [9]                         
[05/28/25 07:43:00] INFO     Waiting for application startup.                   
[05/28/25 07:43:00] INFO     Created in-memory cache with unloading after 300s  
                             of inactivity.                                     
[05/28/25 07:43:00] INFO     Initialized request thread pool with 20 threads.   
[05/28/25 07:43:00] INFO     Application startup complete.                      
[05/28/25 10:04:01] INFO     Loading detection model 'buffalo_l' to memory      
[05/28/25 10:04:01] INFO     Setting execution providers to                     
                             ['OpenVINOExecutionProvider',                      
                             'CPUExecutionProvider'], in descending order of    
                             preference                                         
[05/28/25 10:04:01] INFO     Loading visual model 'ViT-B-32__openai' to memory  
[05/28/25 10:04:01] INFO     Setting execution providers to                     
                             ['OpenVINOExecutionProvider',                      
                             'CPUExecutionProvider'], in descending order of    
                             preference    

the logs of it restarting:

[05/28/25 10:04:30] INFO     Starting gunicorn 23.0.0                           
[05/28/25 10:04:30] INFO     Listening at: http://[::]:3003 (8)                 
[05/28/25 10:04:30] INFO     Using worker: immich_ml.config.CustomUvicornWorker 
[05/28/25 10:04:30] INFO     Booting worker with pid: 9                         
[05/28/25 10:04:31] INFO     Started server process [9]                         
[05/28/25 10:04:31] INFO     Waiting for application startup.                   
[05/28/25 10:04:31] INFO     Created in-memory cache with unloading after 300s  
                             of inactivity.                                     
[05/28/25 10:04:31] INFO     Initialized request thread pool with 20 threads.   
[05/28/25 10:04:31] INFO     Application startup complete. 

and then from the server:

[Nest] 7  - 05/28/2025, 10:04:24 AM   ERROR [Microservices:{"source":"upload","id":"7e6aeae2-7c34-4ca7-b783-f43491dbe0d6"}] Unable to run job handler (face-detection): Error: Machine learning request '{"facial-recognition":{"detection":{"modelName":"buffalo_l","options":{"minScore":0.7}},"recognition":{"modelName":"buffalo_l"}}}' failed for all URLs
Error: Machine learning request '{"facial-recognition":{"detection":{"modelName":"buffalo_l","options":{"minScore":0.7}},"recognition":{"modelName":"buffalo_l"}}}' failed for all URLs
    at MachineLearningRepository.predict (/usr/src/app/dist/repositories/machine-learning.repository.js:98:15)
    at async MachineLearningRepository.detectFaces (/usr/src/app/dist/repositories/machine-learning.repository.js:107:26)
    at async PersonService.handleDetectFaces (/usr/src/app/dist/services/person.service.js:232:52)
    at async JobService.onJobStart (/usr/src/app/dist/services/job.service.js:166:28)
    at async EventRepository.onEvent (/usr/src/app/dist/repositories/event.repository.js:126:13)
    at async /usr/src/app/node_modules/bullmq/dist/cjs/classes/worker.js:526:32
    at async Worker.retryIfFailed (/usr/src/app/node_modules/bullmq/dist/cjs/classes/worker.js:751:24)
[Nest] 7  - 05/28/2025, 10:04:24 AM   ERROR [Microservices:{"source":"upload","id":"7e6aeae2-7c34-4ca7-b783-f43491dbe0d6"}] Unable to run job handler (smart-search): Error: Machine learning request '{"clip":{"visual":{"modelName":"ViT-B-32__openai"}}}' failed for all URLs
Error: Machine learning request '{"clip":{"visual":{"modelName":"ViT-B-32__openai"}}}' failed for all URLs
    at MachineLearningRepository.predict (/usr/src/app/dist/repositories/machine-learning.repository.js:98:15)
    at async MachineLearningRepository.encodeImage (/usr/src/app/dist/repositories/machine-learning.repository.js:116:26)
    at async SmartInfoService.handleEncodeClip (/usr/src/app/dist/services/smart-info.service.js:91:27)
    at async JobService.onJobStart (/usr/src/app/dist/services/job.service.js:166:28)
    at async EventRepository.onEvent (/usr/src/app/dist/repositories/event.repository.js:126:13)
    at async /usr/src/app/node_modules/bullmq/dist/cjs/classes/worker.js:526:32
    at async Worker.retryIfFailed (/usr/src/app/node_modules/bullmq/dist/cjs/classes/worker.js:751:24)
[Nest] 7  - 05/28/2025, 10:04:24 AM   ERROR [Microservices:{"source":"upload","id":"87420a98-ed41-4afa-892e-6a16d96b43cc"}] Unable to run job handler (smart-search): Error: Machine learning request '{"clip":{"visual":{"modelName":"ViT-B-32__openai"}}}' failed for all URLs
Error: Machine learning request '{"clip":{"visual":{"modelName":"ViT-B-32__openai"}}}' failed for all URLs
    at MachineLearningRepository.predict (/usr/src/app/dist/repositories/machine-learning.repository.js:98:15)
    at async MachineLearningRepository.encodeImage (/usr/src/app/dist/repositories/machine-learning.repository.js:116:26)
    at async SmartInfoService.handleEncodeClip (/usr/src/app/dist/services/smart-info.service.js:91:27)
    at async JobService.onJobStart (/usr/src/app/dist/services/job.service.js:166:28)
    at async EventRepository.onEvent (/usr/src/app/dist/repositories/event.repository.js:126:13)
    at async /usr/src/app/node_modules/bullmq/dist/cjs/classes/worker.js:526:32
    at async Worker.retryIfFailed (/usr/src/app/node_modules/bullmq/dist/cjs/classes/worker.js:751:24)
[Nest] 7  - 05/28/2025, 10:04:24 AM   ERROR [Microservices:{"source":"upload","id":"87420a98-ed41-4afa-892e-6a16d96b43cc"}] Unable to run job handler (face-detection): Error: Machine learning request '{"facial-recognition":{"detection":{"modelName":"buffalo_l","options":{"minScore":0.7}},"recognition":{"modelName":"buffalo_l"}}}' failed for all URLs
Error: Machine learning request '{"facial-recognition":{"detection":{"modelName":"buffalo_l","options":{"minScore":0.7}},"recognition":{"modelName":"buffalo_l"}}}' failed for all URLs
    at MachineLearningRepository.predict (/usr/src/app/dist/repositories/machine-learning.repository.js:98:15)
    at async MachineLearningRepository.detectFaces (/usr/src/app/dist/repositories/machine-learning.repository.js:107:26)
    at async PersonService.handleDetectFaces (/usr/src/app/dist/services/person.service.js:232:52)
    at async JobService.onJobStart (/usr/src/app/dist/services/job.service.js:166:28)
    at async EventRepository.onEvent (/usr/src/app/dist/repositories/event.repository.js:126:13)
    at async /usr/src/app/node_modules/bullmq/dist/cjs/classes/worker.js:526:32
    at async Worker.retryIfFailed (/usr/src/app/node_modules/bullmq/dist/cjs/classes/worker.js:751:24)

Additional information

No response

againstpetra avatar May 29 '25 10:05 againstpetra

I am using OpenVINO with my CPU's integrated graphics and I know that can lead to more issues than an external GPU

I hit the same issue with a dedicated GPU (Arc A380) so shouldn't be related to integrated graphics.

I'm running Immich via Podman Quadlets but I'm 99% sure that shouldn't be the issue (at least it never was an issue before)

xarblu avatar Jun 02 '25 12:06 xarblu

Same issue here. Running on Docker v4.42.0 on Windows with dedicated RTX 3090.

redaws00 avatar Jun 07 '25 00:06 redaws00

Are you sure this isn't a case of the container reaching resource limits / running out of memory?

Also @redaws00, it's unlikely that an issue with your GPU has the same root cause as the OP using OpenVINO.

mertalev avatar Jun 07 '25 02:06 mertalev

@mertalev I'm new to this application...and Docker on Windows in general. Is there a good way to get logs to look at? My current work around is to have machine learning in a separate container that I disable and enable when needed. When functioning together, both the web and ML crash.

redaws00 avatar Jun 07 '25 03:06 redaws00

You can post the machine learning logs with docker logs immich_machine_learning.

In general, Docker on Windows can be convenient, but it's also much less stable than Docker on Linux. Issues can often be fixed by upgrading to the latest Docker Desktop version and latest NVIDIA driver, or even just restarting the PC (a classic).

mertalev avatar Jun 07 '25 03:06 mertalev

I tried that already. Unfortunately, my 3090 machine is Windows and i have to keep it that way (being a .Net Developer). Docker logs are no good because once the container locks up, all I get is the following: request returned 500 Internal Server Error for API route and version http://%2F%2F.%2Fpipe%2FdockerDesktopLinuxEngine/v1.49/containers/7fa6423c94830c0969a92a92bebf2dbe2e3647261523a99c596350e979d20327/json, check if the server supports the requested API version

Are there any logs stored in the container?

redaws00 avatar Jun 07 '25 03:06 redaws00

Are you sure this isn't a case of the container reaching resource limits / running out of memory?

I just tried uploading something and had the memory usage side-by-side, it doesn't really spike and there's nothing in the kubectl events about it getting OOM'd or something - it just immediately starts failing probes

0s          Warning   Unhealthy          pod/immich-machine-learning-f6c9d559b-p55tx   Liveness probe failed: Get "http://10.0.0.100:3003/ping": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
0s          Warning   Unhealthy          pod/immich-machine-learning-f6c9d559b-p55tx   Readiness probe failed: Get "http://10.0.0.100:3003/ping": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

and eventually gets killed. I also don't get why it would reach resource limits given it's fine when triggered manually.

againstpetra avatar Jun 07 '25 15:06 againstpetra

That's pretty weird. There should be no difference between what the job does on manual queue vs on upload, and no difference in what it sends to the machine learning service. I would have to guess it's a timing issue of some kind, but I'm not sure where that would be either.

mertalev avatar Jun 07 '25 15:06 mertalev

Are you sure this isn't a case of the container reaching resource limits / running out of memory?

I just tried uploading something and had the memory usage side-by-side, it doesn't really spike and there's nothing in the kubectl events about it getting OOM'd or something - it just immediately starts failing probes

0s          Warning   Unhealthy          pod/immich-machine-learning-f6c9d559b-p55tx   Liveness probe failed: Get "http://10.0.0.100:3003/ping": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
0s          Warning   Unhealthy          pod/immich-machine-learning-f6c9d559b-p55tx   Readiness probe failed: Get "http://10.0.0.100:3003/ping": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

and eventually gets killed. I also don't get why it would reach resource limits given it's fine when triggered manually.

I'm seeing exactly the same: the /ping endpoint stops functioning as soon as I trigger e.g. a smart search. Kubernetes then kills the pod after a while.

croneter avatar Oct 08 '25 11:10 croneter

The likely reason is that the model has to be compiled to OpenVINO format the first time it's loaded, which (unfortunately) blocks the ML server from responding to any requests until it's done.

mertalev avatar Oct 08 '25 15:10 mertalev

About how long should it take to compile? I realize that times differ system to system.

redaws00 avatar Oct 08 '25 15:10 redaws00

Typically a few minutes, at most maybe 15 in a very constrained environment with a large model.

mertalev avatar Oct 08 '25 16:10 mertalev

I don't think that's the problem I'm having. My current solution is to separate the ML from the main app and only run the ML once a week while manually triggering the jobs. It still occasionally locks up all of Docker (on Windows), but I can get the jobs completed.

redaws00 avatar Oct 08 '25 16:10 redaws00