immich icon indicating copy to clipboard operation
immich copied to clipboard

running tasks are not power-loss / kernelpanic safe.

Open GottZ opened this issue 6 months ago • 7 comments

I have searched the existing issues, both open and closed, to make sure this is not a duplicate report.

  • [x] Yes

The bug

sadly my hetzner server sometimes force reboots due to too high cpu temperature. I'm on it, but I noticed immich being unable to have transisition safe states. tasks should have a "is generating" state in db, to re-initiate unfinished tasks on reboot, before actually inserting faulty data into the live environment.

The OS that Immich Server is running on

Arch via docker compose

Version of Immich Server

v1.134.0

Version of Immich Mobile App

irrelevant

Platform with the issue

  • [x] Server
  • [ ] Web
  • [ ] Mobile

Your docker-compose.yml content

#
# WARNING: To install Immich, follow our guide: https://immich.app/docs/install/docker-compose
#
# Make sure to use the docker-compose.yml of the current release:
#
# https://github.com/immich-app/immich/releases/latest/download/docker-compose.yml
#
# The compose file on main may not be compatible with the latest release.

name: immich

services:
  immich-server:
    container_name: immich_server
    image: ghcr.io/immich-app/immich-server:${IMMICH_VERSION:-release}
    # extends:
    #   file: hwaccel.transcoding.yml
    #   service: cpu # set to one of [nvenc, quicksync, rkmpp, vaapi, vaapi-wsl] for accelerated transcoding
    volumes:
      # Do not edit the next line. If you want to change the media storage location on your system, edit the value of UPLOAD_LOCATION in the .env file
      - ${UPLOAD_LOCATION}:/usr/src/app/upload
      - /etc/localtime:/etc/localtime:ro
    env_file:
      - .env
    # ports:
    #  - '2283:2283'
    networks:
      - immich
      - reverseproxy
    depends_on:
      - redis
      - database
    restart: unless-stopped
    healthcheck:
      disable: false

  immich-machine-learning:
    container_name: immich_machine_learning
    # For hardware acceleration, add one of -[armnn, cuda, rocm, openvino, rknn] to the image tag.
    # Example tag: ${IMMICH_VERSION:-release}-cuda
    image: ghcr.io/immich-app/immich-machine-learning:${IMMICH_VERSION:-release}
    # extends: # uncomment this section for hardware acceleration - see https://immich.app/docs/features/ml-hardware-acceleration
    #   file: hwaccel.ml.yml
    #   service: cpu # set to one of [armnn, cuda, rocm, openvino, openvino-wsl, rknn] for accelerated inference - use the `-wsl` version for WSL2 where applicable
    networks:
      - immich
    volumes:
      - model-cache:/cache
    env_file:
      - .env
    restart: unless-stopped
    healthcheck:
      disable: false

  redis:
    container_name: immich_redis
    image: docker.io/valkey/valkey:8-bookworm@sha256:ff21bc0f8194dc9c105b769aeabf9585fea6a8ed649c0781caeac5cb3c247884
    networks:
      - immich
    healthcheck:
      test: redis-cli ping || exit 1
    restart: unless-stopped

  database:
    container_name: immich_postgres
    image: ghcr.io/immich-app/postgres:14-vectorchord0.3.0-pgvectors0.2.0@sha256:fa4f6e0971f454cd95fec5a9aaed2ed93d8f46725cc6bc61e0698e97dba96da1
    networks:
      - immich
    environment:
      POSTGRES_PASSWORD: ${DB_PASSWORD}
      POSTGRES_USER: ${DB_USERNAME}
      POSTGRES_DB: ${DB_DATABASE_NAME}
      POSTGRES_INITDB_ARGS: '--data-checksums'
      # Uncomment the DB_STORAGE_TYPE: 'HDD' var if your database isn't stored on SSDs
      # DB_STORAGE_TYPE: 'HDD'
    volumes:
      # Do not edit the next line. If you want to change the database storage location on your system, edit the value of DB_DATA_LOCATION in the .env file
      - ${DB_DATA_LOCATION}:/var/lib/postgresql/data
    restart: unless-stopped

volumes:
  model-cache:

networks:
  reverseproxy:
    name: reverseproxy
    external: true
  immich:
    name: immich

Your .env content

# You can find documentation for all the supported env variables at https://immich.app/docs/install/environment-variables

# The location where your uploaded files are stored
UPLOAD_LOCATION=/mnt/storage/immich/library

# The location where your database files are stored. Network shares are not supported for the database
DB_DATA_LOCATION=./db

# To set a timezone, uncomment the next line and change Etc/UTC to a TZ identifier from this list: https://en.wikipedia.org/wiki/List_of_tz_database_time_zones#List
TZ=Europe/Berlin

# The Immich version to use. You can pin this to a specific version like "v1.71.0"
IMMICH_VERSION=release

# Connection secret for postgres. You should change it to a random password
# Please use only the characters `A-Za-z0-9`, without special characters or spaces
DB_PASSWORD=redactedForObviousReason

# The values below this line do not need to be changed
###################################################################################
DB_USERNAME=postgres
DB_DATABASE_NAME=immich

Reproduction steps

  1. start file import via mobile app or api with a couple hundred pictures
  2. wait until tasks start generating previews etc.
  3. cut power / initiate kernel panic
  4. reboot server
  5. browse immich web. -> broken thumbnails and previews, only replacable by deletion and re-upload

Relevant log output

cutting server power produces no logs.

Additional information

No response

GottZ avatar Jun 05 '25 13:06 GottZ

Is the redis container state being lost on reboot? If so you might want to mount a volume for it.

bo0tzz avatar Jun 05 '25 13:06 bo0tzz

only replacable by deletion and re-upload

This is probably not the case btw. There's buttons in the top-right menu to refresh jobs for an asset, or you can run jobs in bulk from the admin panel.

bo0tzz avatar Jun 05 '25 13:06 bo0tzz

only replacable by deletion and re-upload

This is probably not the case btw. There's buttons in the top-right menu to refresh jobs for an asset, or you can run jobs in bulk from the admin panel.

na. the button doesn't do anything. preview stayed broken. does it need browser cache invalidation? I didn't check for that.

GottZ avatar Jun 05 '25 13:06 GottZ

Is the redis container state being lost on reboot? If so you might want to mount a volume for it.

well.. it's the recommended conf with docker.io/valkey/valkey:8-bookworm so.. I suppose it would be wise to consider adding support for persistence in the default config then by either appendonly or another mechanic of redis

I'm wondering if a race condition could come up.. is the "in progress" state syncronized to storage with waiting for completion? otherwise the power loss would cause the same issue. it has to be blocking IO or otherwise it's not doing it's job.

it's a docker volume btw so persistence exists across reboots. Image

GottZ avatar Jun 05 '25 14:06 GottZ

update:

the previews don't generate, cause the file uploads caused a weird state of brokenness:

Image

in essence, the raw files have 0 bytes, while the database has metadata and checksums about them.

GottZ avatar Jun 05 '25 14:06 GottZ

I don't see any way that could happen from Immich's end - I suspect it's your filesystem causing problems.

bo0tzz avatar Jun 05 '25 14:06 bo0tzz

I don't see any way that could happen from Immich's end - I suspect it's your filesystem causing problems.

sorry I've been flat-lined for the last few days with a intense cold so.. here we go:

the database is hosted on a nvme raid 0, so obviously, anything written to the database pretty much persists instantly.

the photo library is stored on a raid 0 with two hdd's. in a sense, slow as f*ck to persist anything.

now considering people detection etc, ran through, before the original image was sucessfully persisted, I'd suggest adding a mechanic, that upon immich startup checks the latest images persistence, rather than implying they already exist, when trying to re-upload.

dead images don't allow override as long as they exist in database.

GottZ avatar Jun 09 '25 05:06 GottZ

I don't think this is something that we're going to investigate or fix. We have some things in place to help the user manage such situations though:

  • Immich doesn't create a record in the database until the entire file is uploaded and the checksum validated (to be unique) - Jobs are indeed not cleared or automatically restarted/reset on boot, this is by design
  • There is an entire administration screen on web which enables the user to resolve these types of issues on their own
  • Jobs are automatically scheduled to run at night which will generate thumbnails for assets that are missing them, and other database related clean-up tasks
  • #12293 to track generate "integrity" features, which the goal of being able to detect when the contents of the file on disk are changed/missing, including information about generated thumbnails.

jrasm91 avatar Sep 15 '25 17:09 jrasm91