helm-charts icon indicating copy to clipboard operation
helm-charts copied to clipboard

[Bug] service init containers not finishing due to waiting for default keyspace to become ready

Open mickmcgrath13 opened this issue 3 years ago • 19 comments

Describe the bug Installing the helm chart as follows:

  • elasticsearch disabled
  • mysql disabled
  • prometheus disabled
  • grafana enabled
  • cassandra enabled
  • cassandra persistence disabled

Got an error in many of the services' (frontend, history, matching, etc) init containers when installing:

waiting for default keyspace to become ready

Apparently, the keyspace becomes ready with the Job: https://github.com/temporalio/helm-charts/blob/master/templates/server-job.yaml

The job has the following annotations:

  annotations:
    {{- if .Values.cassandra.enabled }}
    "helm.sh/hook": post-install
    {{- else }}
    "helm.sh/hook": pre-install
    {{- end }}

..it seems that the post-install hook doesn't execute until after the pods are ready, which they don't because they're waiting for this job to run.

The job already has init containers that are waiting for cassandra to come up, so I'm not sure the install hooks are necessary.

To Reproduce Steps to reproduce the behavior:

  1. create a temporal namespace
  2. use the following values.yaml
server:
  enabled: true
  replicaCount: 1
  config:
    persistence:
      default:
        driver: "cassandra"
        cassandra:
          hosts: ["temporal-cassandra.temporal.svc.cluster.local"]
          # port: 9042
          keyspace: "temporal"
          user: "user"
          password: "password"
          existingSecret: ""
          replicationFactor: 1
          consistency:
            default:
              consistency: "local_quorum"
              serialConsistency: "local_serial"
      visibility:
        driver: "cassandra"

        cassandra:
          hosts: ["temporal-cassandra.temporal.svc.cluster.local"]
          keyspace: "temporal_visibility"
          user: "user"
          password: "password"
          existingSecret: ""
          replicationFactor: 1
          consistency:
            default:
              consistency: "local_quorum"
              serialConsistency: "local_serial"
  frontend:
    replicaCount: 1

  history:
    replicaCount: 1

  matching:
    replicaCount: 1

  worker:
    replicaCount: 1

admintools:
  enabled: true
web:
  enabled: true
  replicaCount: 1
schema:
  setup:
    enabled: true
    backoffLimit: 100
  update:
    enabled: true
    backoffLimit: 100
elasticsearch:
  enabled: false
prometheus:
  enabled: false
grafana:
  enabled: false
cassandra:
  enabled: true
  persistence:
    enabled: false
  config:
    cluster_size: 3
    ports:
      cql: 9042
    num_tokens: 4
    max_heap_size: 512M
    heap_new_size: 128M
    seed_size: 0
  env:
    CASSANDRA_PASSWORD: password

Expected behavior A clear and concise description of what you expected to happen.

Screenshots/Terminal ouput log the check-cassandra-temporal-schema init container of the frontend service during the deployment yields:

waiting for default keyspace to become ready

Versions (please complete the following information where relevant):

  • OS:
Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.3", GitCommit:"1e11e4a2108024935ecfcb2912226cedeafd99df", GitTreeState:"clean", BuildDate:"2020-10-14T12:50:19Z", GoVersion:"go1.15.2", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"17+", GitVersion:"v1.17.17-eks-c5067d", GitCommit:"c5067dd1eb324e934de1f5bd4c593b3cddc19d88", GitTreeState:"clean", BuildDate:"2021-03-05T23:39:01Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}
  • Temporal Version
    • Chart version 0.9.3
  • are you using Docker or Kubernetes or building Temporal from source?
    • not building from source

mickmcgrath13 avatar May 20 '21 13:05 mickmcgrath13

i'm not sure the annotation issues are still a problem based on the current state of the repo - could you confirm that you're still seeing an issue?

underrun avatar Sep 22 '21 13:09 underrun

I cannot confirm at this time as I've moved away from the deployment. I suggest this issue be closed, and someone can re-open or create a new issue should the problem present itself again.

mickmcgrath13 avatar Nov 03 '21 13:11 mickmcgrath13

@mickmcgrath13 @underrun We've just been testing out deployment of the helm chart in our EKS cluster and encountered this error. I tried to work around it by adjusting the hook to be pre-install, but that led to another error:


2022-05-18T17:15:02.728Z    INFO    Validating connection to cassandra cluster.    {"logging-call-at": "cqlclient.go:113"}
2022/05/18 17:15:02 error: failed to connect to 172.17.2.24:9042 due to error: Keyspace 'temporal' does not exist
2022-05-18T17:15:02.869Z    ERROR    Connection validation failed.    {"error": "no connections were made when creating the session", "logging-call-at": "cqlclient.go:116"}
2022-05-18T17:15:02.869Z    ERROR    Unable to establish CQL session.    {"error": "no connections were made when creating the session", "logging-call-at": "handler.go:77"}

jessebye avatar May 18 '22 17:05 jessebye

I faced same issue, I had to disable hooks for schema setup and update so that it works.

My findings are that since

  • Hooks are post install
  • Schema jobs are depending on server to be enabled
  • and since server requires schemas to load up correctly There will be never a condition that schema job is successful for initial setup

I wonder if having these jobs as post install is required. Also if this condition https://github.com/temporalio/helm-charts/blob/master/templates/server-job.yaml#L1 is necessary for these jobs

saberistic avatar Aug 28 '22 23:08 saberistic

Do you guys have any update on this ticket, i could still reproduce the issue with the latest helm charts. I tried deleteing the namespace and created the same and tried deploying helm charts, still no luck :(

kubectl logs -n temporal temporal-history-7549479cdc-zmdkz check-cassandra-temporal-schema
waiting for default keyspace to become ready
waiting for default keyspace to become ready
waiting for default keyspace to become ready
waiting for default keyspace to become ready

nkkrishna avatar Jan 12 '23 12:01 nkkrishna

I had to fork and disable install hooks for helm chart to release/deploy Temporal resources.

https://github.com/BAXUSNFT/temporal-helm-charts

However you will still need to run those schema migrations and register default namespace somehow which I done with converting them to init containers of workers deployments.

This could be a helpful snippet for a MySQL setup


initContainers:
        - name: init-cont
          image: busybox:1.31
          command: ['sh', '-c', 'echo -e "Checking for the availability of MySQL Server deployment"; while ! nc -z mysql-headless 3306; do sleep 30; printf "-"; done; echo -e "  >> MySQL DB Server has started";']
        - name: setup-temporal-schema
          image: alpine
          command: ['sh', '-c', 'echo -e "Running Temporal schema migrations"; apk add git; git clone https://github.com/temporalio/temporal.git;  wget -c https://github.com/temporalio/temporal/releases/download/v1.17.6/temporal_1.17.6_linux_amd64.tar.gz -O - | tar -xz; SQL_DATABASE=temporal ./temporal-sql-tool setup-schema -v 0.0 || true; SQL_DATABASE=temporal ./temporal-sql-tool update -schema-dir temporal/schema/mysql/v57/temporal/versioned || true; SQL_DATABASE=temporal_visibility ./temporal-sql-tool setup-schema -v 0.0 || true; SQL_DATABASE=temporal_visibility ./temporal-sql-tool update -schema-dir temporal/schema/mysql/v57/visibility/versioned || true; printf "-"; echo -e "  >> Temporal schema setup and updated";']
          env:
            - name: SQL_PLUGIN
              value: mysql
            - name: SQL_USER
              value: temporal
            - name: SQL_HOST
              value: "mysql-headless"
            - name: SQL_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: env
                  key: SQL_PASSWORD
        - name: check-temporal
          image: busybox:1.31
          command: ['sh', '-c', 'echo -e "Checking for the availability of Temporal Server deployment"; while ! nc -z temporal-frontend 7233; do sleep 30; printf "-"; done; echo -e "  >> Temporal Server has started";']
        - name: check-temporal-namespace
          image: alpine
          command: ['sh', '-c', 'echo -e "Checking for the availability of Temporal Server namespace"; wget -c https://github.com/temporalio/tctl/releases/download/v1.16.2/tctl_1.16.2_linux_amd64.tar.gz -O - | tar -xz; TEMPORAL_CLI_ADDRESS=temporal-frontend:7233 ./tctl namespace register || true; printf "-"; echo -e "  >> Temporal default namespace registered";']

Worth mentioning that we only adopt this in local development clusters and for anything in production this is not recommended. Checkout Temporal Cloud for scenarios that scalability, reliability matters!

saberistic avatar Jan 12 '23 14:01 saberistic

https://github.com/BAXUSNFT/temporal-helm-charts

Same, but is it the problem that cassandra doesn't start or is it the init container configuration wrong?

edmondop avatar Jan 23 '23 18:01 edmondop

oh I skipped Cassandra for simplicity so not sure what error you could be getting. We had MySQL running so that was more convenient

saberistic avatar Jan 24 '23 02:01 saberistic

Also the same problem here! It looks like it will work with other databases than cassandra, as the hook will be "pre-install", but if you want to test it with Cassandra in the simple setup provided by the Helm chart, then you face this issue, as the setup and update jobs are never executed.

froblesmartin avatar May 29 '23 21:05 froblesmartin

Still hitting this issue.

yzgyyang avatar Aug 15 '23 15:08 yzgyyang

Same issue waiting for default keyspace to become ready

LukaGiorgadze avatar Sep 19 '23 17:09 LukaGiorgadze

Ran into this issue when deploying via Flux. The fix was to disable --wait behavior for install and upgrade actions. Maybe try your helm install without --wait?

davidhiebert avatar Sep 20 '23 22:09 davidhiebert

just helm install . works fine, but I get this issue when I'm deploying with terraform helm_release

LukaGiorgadze avatar Sep 20 '23 23:09 LukaGiorgadze

Ran into this issue when deploying via Flux. The fix was to disable --wait behavior for install and upgrade actions. Maybe try your helm install without --wait?

Nah, --wait probably won't help. You will get success but jobs won't start and db schemas won't be available.

Solution is pretty simple, here's my answer on that https://github.com/temporalio/helm-charts/issues/404#issuecomment-1729671491

LukaGiorgadze avatar Sep 23 '23 15:09 LukaGiorgadze

@LukaGiorgadze I think you misread my comment. My recommendation is to do the helm install WITHOUT the --wait. That was per a comment on a Flux git issue.

While I'm sure your linked comment is relevant in cases where additional values are included, the bug reported in this issue occurs without additional configurations as well. In this even, the schema.setup.enabled is still confirmed to be true. Disabling --wait fixed it in my case, and I hope this is helpful to others.

davidhiebert avatar Sep 25 '23 17:09 davidhiebert

I have a version of the helm chart for cadence that works, and it's based on the observations above. There's a loop in the database setup and the jobs that cannot be solved as of today

edmondop avatar Sep 25 '23 17:09 edmondop

I have similar issue where schema setup never completes , the cron job never passes and errors out complaining about "keyspace does not exist" ! even if the cassandra DB comes up and works fine , the job that suppose to do the schema-setup / schema-update never really works ! I'm facing a real issue cause I'm using temporalio chart + crossplane /cassandra chart , cassandra comes backup find but some how I have to delete the server-job in order to get the chart deploying , i haven't made any changes besides adding affinity/node selectors to restrict where to deploy and that works fine ! my crossplane chart is the one containing the GKE cluster itself so that part also just a dependency and shouldn't cause any issues

mo-dayyeh avatar Mar 12 '24 19:03 mo-dayyeh

I'm facing the same issue as @jessebye , I use argocd application that consists of cassandra chart (comes with temporal chart) and crossplane chart (gke resource and other stuff that I use for my environment) the chart either stuck at cron job waiting for ever , if I delete the job , the chart deploys but schema-setup never happens and I get the error that @jessebye faced

mo-dayyeh avatar Mar 12 '24 19:03 mo-dayyeh

As a workaround i did:

helm template temporal ./charts --timeout 900s --namespace temporal --debug --set grafana.enabled=false --set prometheus.enabled=false > job.yaml

Then I applied the job.yaml manually with kubectl apply -f job.yaml --namespace temporal

zackerydev avatar May 01 '24 23:05 zackerydev