optscale feat: Use Helm hooks for applying database migrations (MPT-12683)

Description

Use Helm Hooks for database migrations for all the services using alembic, clickhouse or mongodb migrations:

rest_api - alembic
auth - alembic
herald - alembic
jira_bus - alembic
slacker - alembic
katara - alembic
risp_worker - clickhouse
metroculus_worker - clickhouse
gemini_worker - clickhouse
insider_worker - mongo
diworker - mongo

There is a seperate hook for each of the above services. That way we can:

ensure that the migrations are run only once per deployment (no matter how many pods are spawned for each of them)
have per-service dependancies for running the migrations. For example, the diworker's migrations require not only mongo, clickhouse and rabbitmq but also rest_api to be running before they are applied because some of these migrations make calls to rest_api. And of course the rest_api's own migrations can't require the service to be running as they are executed before the server starts.

Related issue number

MPT-12683: https://softwareone.atlassian.net/browse/MPT-12683

Special notes

In our Optscale deployment we're facing issues with migrations depending on EtcdLock, mostly as we have a custom deployment of etcd. So, to work around that (and not rely on etcd) we're instead running the migrations in a Helm hook job.

As I needed to apply this to multiple services, I also extracted the migrate.py scripts into a common place -- tools/db (and made some minor changes to allow it to be run for any service and inside a helm job).

Checklist

[x] The pull request title is a good summary of the changes
[ ] ~Unit tests for the changes exist~ N/A
[x] New and existing unit tests pass locally

Aug 19 '25 14:08 arturbalabanov

Looks like I missed a few services:

[x] risp_worker
[x] insider_worker
[x] metroculus_worker
[x] gemini_worker
[x] diworker

Let me know if others are missing still. I'm marking this PR as Draft until this is finished

Sep 18 '25 13:09 arturbalabanov

Looks like I missed a few services:

[ ] risp_worker

[ ] insider_worker

[ ] metroculus_worker

[ ] gemini_worker

[ ] diworker

Let me know if others are missing still. I'm marking this PR as Draft until this is finished

I checked all services that use migrations (both with and without locks). Your list of missed services is correct.

Sep 23 '25 15:09 sd-hystax

i've tried to start cluster, but it failing with

(.venv) vlad@ops-experimental:~/optscale/optscale-deploy$ ./runkube.py --no-pull --with-elk  -o overlay/user_template.yml -- optscale local
21:50:05.120: Connecting to ctd daemon 172.25.1.157:2376
21:50:05.120: Сomparing local images for 172.25.1.157
21:50:10.870: Generating base overlay...
21:50:10.878: Connecting to ctd daemon 172.25.1.157:2376
21:50:13.760: Creating component_versions.yaml file to insert it into configmap
21:50:13.762: Deleting /configured key
21:50:13.765: etcd pod not found
21:50:13.775: Waiting for job deletion...
21:50:13.775: Starting helm chart optscale with name optscale on k8s cluster 172.25.1.157
Error: UPGRADE FAILED: pre-upgrade hooks failed: 1 error occurred:
  * timed out waiting for the condition


Traceback (most recent call last):
  File "/home/vlad/optscale/optscale-deploy/./runkube.py", line 485, in <module>
    acr.start(args.check, args.update_only)
  File "/home/vlad/optscale/optscale-deploy/./runkube.py", line 394, in start
    subprocess.run(update_cmd.split(), check=True)
  File "/usr/lib/python3.12/subprocess.py", line 571, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['helm', 'upgrade', '--install', '-f', 'tmp/base_overlay', '-f', 'overlay/user_template.yml', 'optscale', 'optscale']' returned non-zero exit status 1.

this is some debug info

(.venv) vlad@ops-experimental:~/optscale/optscale-deploy$ helm list -A --all
helm status optscale -n default
helm history optscale -n default
kubectl get all -n default
kubectl get events -n default --sort-by=.metadata.creationTimestamp | tail -n 100
NAME      NAMESPACE REVISION  UPDATED                                 STATUS    CHART                             APP VERSION
ngingress default   1         2025-09-26 06:45:39.953910084 +0000 UTC deployed  nginx-ingress-controller-11.3.17  1.11.1     
optscale  default   2         2025-10-14 21:50:13.935580134 +0000 UTC failed    optscale-0.1.0                               
NAME: optscale
LAST DEPLOYED: Tue Oct 14 21:50:13 2025
NAMESPACE: default
STATUS: failed
REVISION: 2
TEST SUITE: None
REVISION  UPDATED                   STATUS  CHART           APP VERSION DESCRIPTION                                                           
1         Tue Oct 14 12:51:44 2025  failed  optscale-0.1.0              Release "optscale" failed: failed pre-install: 1 error occurred:      
                                                                          * t...                                                                
2         Tue Oct 14 21:50:13 2025  failed  optscale-0.1.0              Upgrade "optscale" failed: pre-upgrade hooks failed: 1 error occurr...
NAME                                                                  READY   STATUS     RESTARTS   AGE
pod/auth-migrations-hs8bq                                             0/1     Init:0/3   0          5m37s
pod/ngingress-nginx-ingress-controller-62zsh                          1/1     Running    0          18d
pod/ngingress-nginx-ingress-controller-default-backend-78ccb69cdxsz   1/1     Running    0          18d

NAME                                                         TYPE           CLUSTER-IP     EXTERNAL-IP   PORT(S)                      AGE
service/kubernetes                                           ClusterIP      10.96.0.1      <none>        443/TCP                      18d
service/ngingress-nginx-ingress-controller                   LoadBalancer   10.96.242.0    <pending>     80:29656/TCP,443:25388/TCP   18d
service/ngingress-nginx-ingress-controller-default-backend   ClusterIP      10.96.204.63   <none>        80/TCP                       18d

NAME                                                DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
daemonset.apps/ngingress-nginx-ingress-controller   1         1         1       1            1           <none>          18d

NAME                                                                 READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/ngingress-nginx-ingress-controller-default-backend   1/1     1            1           18d

NAME                                                                            DESIRED   CURRENT   READY   AGE
replicaset.apps/ngingress-nginx-ingress-controller-default-backend-78ccb69796   1         1         1       18d

NAME                        STATUS    COMPLETIONS   DURATION   AGE
job.batch/auth-migrations   Running   0/1           5m37s      5m37s
LAST SEEN   TYPE     REASON             OBJECT                      MESSAGE
5m37s       Normal   Killing            pod/auth-migrations-6rq5w   Stopping container wait-elk
5m37s       Normal   Scheduled          pod/auth-migrations-hs8bq   Successfully assigned default/auth-migrations-hs8bq to ops-experimental
5m37s       Normal   SuccessfulCreate   job/auth-migrations         Created pod: auth-migrations-hs8bq
5m36s       Normal   Pulled             pod/auth-migrations-hs8bq   Container image "busybox:1.30.0" already present on machine
5m36s       Normal   Created            pod/auth-migrations-hs8bq   Created container: wait-elk
5m36s       Normal   Started            pod/auth-migrations-hs8bq   Started container wait-elk



(.venv) vlad@ops-experimental:~/optscale/optscale-deploy$ kubectl get pods -n default --field-selector=status.phase!=Running -o wide
NAME                    READY   STATUS     RESTARTS   AGE    IP           NODE               NOMINATED NODE   READINESS GATES
auth-migrations-hs8bq   0/1     Init:0/3   0          9m1s   10.254.0.5   ops-experimental   <none>           <none>


(.venv) vlad@ops-experimental:~/optscale/optscale-deploy$ kubectl describe pod auth-migrations-hs8bq
Name:             auth-migrations-hs8bq
Namespace:        default
Priority:         0
Service Account:  default
Node:             ops-experimental/172.25.1.157
Start Time:       Tue, 14 Oct 2025 21:50:15 +0000
Labels:           batch.kubernetes.io/controller-uid=898579ed-47ef-4af3-a39f-9c5b81082b6f
                  batch.kubernetes.io/job-name=auth-migrations
                  controller-uid=898579ed-47ef-4af3-a39f-9c5b81082b6f
                  job-name=auth-migrations
Annotations:      <none>
Status:           Pending
IP:               10.254.0.5
IPs:
  IP:           10.254.0.5
Controlled By:  Job/auth-migrations
Init Containers:
  wait-elk:
    Container ID:  containerd://1e33925de202b130455f029fabfa30246e0783dff15ac671a2705fd388abbe9e
    Image:         busybox:1.30.0
    Image ID:      docker.io/library/busybox@sha256:7964ad52e396a6e045c39b5a44438424ac52e12e4d5a25d94895f2058cb863a0
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
      until nc -z elk.default.svc.cluster.local 9200 -w 2; do sleep 2; done && until nc -z elk.default.svc.cluster.local 12201 -w 2; do sleep 2; done
    State:          Running
      Started:      Tue, 14 Oct 2025 21:50:16 +0000
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-4mfwf (ro)
  wait-etcd-client:
    Container ID:  
    Image:         busybox:1.30.0
    Image ID:      
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
      until nc -z etcd-client.default.svc.cluster.local 2379 -w 2; do sleep 2; done
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-4mfwf (ro)
  wait-mariadb:
    Container ID:  
    Image:         mariadb:local
    Image ID:      
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
      until mysql --connect-timeout=2 -h mariadb.default.svc.cluster.local -p$MYSQL_ROOT_PASSWORD -e "SELECT 1"; do sleep 2; done
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:
      MYSQL_ROOT_PASSWORD:  <set to the key 'password' in secret 'mariadb-secret'>  Optional: false
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-4mfwf (ro)
Containers:
  auth-migrations:
    Container ID:  
    Image:         auth:local
    Image ID:      
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/sh
      -c
    Args:
      uv run --project "auth" db migrate "auth"
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:
      HX_ETCD_HOST:  etcd-client
      HX_ETCD_PORT:  2379
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-4mfwf (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True 
  Initialized                 False 
  Ready                       False 
  ContainersReady             False 
  PodScheduled                True 
Volumes:
  kube-api-access-4mfwf:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  10m   default-scheduler  Successfully assigned default/auth-migrations-hs8bq to ops-experimental
  Normal  Pulled     10m   kubelet            Container image "busybox:1.30.0" already present on machine
  Normal  Created    10m   kubelet            Created container: wait-elk
  Normal  Started    10m   kubelet            Started container wait-elk

it looks for me cluster is trying to start, but failing waiting for wait elk, but it cannot start before hook.

I'm not absolutely sure , but maybe need to wait for Jobs without a deletion policy of hook-succeeded or hook-failed.

"helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded

so, helm will

Create the Job
Mark the hook “executed” immediately
Don't not wait for Job completion

Deletes the Job after it succeeds (or before the next upgrade), this hook will run in background, and Helm won’t block waiting for it, but it looks like disrupt is possible here, because the services may start in parallel.

Also a couple of concerns from my side:

Helm hooks can run DB migrations reliably but need to be 100% sure migrations is idempotent, transactional, retry-safe, and serialized. Helm can’t “roll back” database. If release fails after a schema change, Helm’s rollback won’t undo that change.
In our case several services touch one DB and one service touches many DBs, currently we don't have a single owner of each database’s schema and migrations. It looks like need to locking so only one migration runs at a time (tool-level locks, or DB advisory locks).

So, in my opinion, in the current case need to render one migration Job per service, not only for DB.

Oct 14 '25 22:10 nexusriot

After investigating problematic I will not suggest to trigger migrations with Helm hooks, Hooks are synchronous relative to Helm, not our cluster. IMHO if we have to implement best practice to separate migrations from service, need to ship migrations as a Kubernetes Job (not a hook) .Run it as a batch/v1 Job with sane backoffLimit (need to make sure all our changes is idempotent and retriable)

This keeps migrations visible/observable and decoupled from Helm’s lifecycle. It’s also a widely recommended approach.

https://www.linkedin.com/pulse/navigating-database-migrations-kubernetes-helm-hooks-vs-bdour-akram-bpaoe

https://medium.com/@inchararlingappa/handling-migration-with-helm-28b9884c94a6

https://devops.stackexchange.com/questions/15261/helm-long-running-jobs-vs-long-running-hooks

When Helm hooks are okay?

Small projects, quick one-off tasks, or “install-time sanity checks.” (Not out case with complicated waiters and startup logic)

Oct 20 '25 11:10 nexusriot