helm-charts icon indicating copy to clipboard operation
helm-charts copied to clipboard

Error when upgrade from 0.13.1 to latest version

Open throrin19 opened this issue 1 year ago • 19 comments

What happened? After upgrade, all my nodes return this error :

2023-01-20 10:15:50,010 ERROR: create_config_service failed
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/patroni/dcs/kubernetes.py", line 890, in _create_config_service
    if not self._api.create_namespaced_service(self._namespace, body):
  File "/usr/lib/python3/dist-packages/patroni/dcs/kubernetes.py", line 468, in wrapper
    return getattr(self._core_v1_api, func)(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/patroni/dcs/kubernetes.py", line 404, in wrapper
    return self._api_client.call_api(method, path, headers, body, **kwargs)
  File "/usr/lib/python3/dist-packages/patroni/dcs/kubernetes.py", line 373, in call_api
    return self._handle_server_response(response, _preload_content)
  File "/usr/lib/python3/dist-packages/patroni/dcs/kubernetes.py", line 203, in _handle_server_response
    raise k8s_client.rest.ApiException(http_resp=response)
patroni.dcs.kubernetes.K8sClient.rest.ApiException: (403)
Reason: Forbidden
HTTP response headers: HTTPHeaderDict({'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Content-Type-Options': 'nosniff', 'X-Kubernetes-Pf-Flowschema-Uid': 'e47d519f-f244-47a4-ad9f-201cbe928c4a', 'X-Kubernetes-Pf-Prioritylevel-Uid': 'f182df09-64d8-40dd-8faa-445be223320d', 'Date': 'Fri, 20 Jan 2023 10:15:50 GMT', 'Content-Length': '300'})
HTTP response body: b'{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"services is forbidden: User \\"system:serviceaccount:opennms:timescaledb\\" cannot create resource \\"services\\" in API group \\"\\" in the namespace \\"stage\\"","reason":"Forbidden","details":{"kind":"services"},"code":403}\n'

Did you expect to see something different?

Yes, no error in my timescaledb nodes

How to reproduce it (as minimally and precisely as possible):

  1. Use this timescaledb-ha image : pg12.13-ts2.9.1-latest
  2. Install the chart timescaledb-single in version 0.13.0
  3. try to upgrade to latest or others 0.1X.X
  4. And voilà ! You have same error in nodes

Environment

  • Which helm chart and what version are you using?

TimescaleDB-single in 0.13.1 at start and I try to upgrade to 0.30.0

  • What is in your values.yaml ?
affinity: {}
backup:
  enabled: true
  env: null
  envFrom: null
  jobs:
    - name: full-weekly
      schedule: 12 02 * * 0
      type: full
    - name: incremental-daily
      schedule: 12 02 * * 1-6
      type: incr
  pgBackRest:
    compress-type: lz4
    process-max: 4
    repo1-cipher-type: none
    repo1-retention-diff: 2
    repo1-retention-full: 2
    repo1-s3-endpoint: s3.amazonaws.com
    repo1-s3-region: eu-west-1
    repo1-type: s3
    start-fast: 'y'
  pgBackRest:archive-get: {}
  pgBackRest:archive-push: {}
  resources: {}
bootstrapFromBackup:
  enabled: false
  repo1-path: null
  secretName: pgbackrest-bootstrap
callbacks:
  configMap: null
clusterName: stage
debug:
  execStartPre: null
env:
  - name: TIMESCALEDB_TELEMETRY
    value: 'off'
envFrom: null
fullnameOverride: '{{ .Release.Name }}'
image:
  pullPolicy: Always
  repository: timescale/timescaledb-ha
  tag: pg12.13-ts2.9.1-latest
networkPolicy:
  enabled: false
  ingress: null
  prometheusApp: prometheus
nodeSelector: {}
patroni:
  bootstrap:
    dcs:
      loop_wait: 10
      maximum_lag_on_failover: 33554432
      postgresql:
        parameters:
          archive_command: /etc/timescaledb/scripts/pgbackrest_archive.sh %p
          archive_mode: 'on'
          archive_timeout: 1800s
          autovacuum_analyze_scale_factor: 0.02
          autovacuum_max_workers: 10
          autovacuum_naptime: 5s
          autovacuum_vacuum_cost_limit: 500
          autovacuum_vacuum_scale_factor: 0.05
          hot_standby: 'on'
          log_autovacuum_min_duration: 1min
          log_checkpoints: 'on'
          log_connections: 'on'
          log_disconnections: 'on'
          log_line_prefix: '%t [%p]: [%c-%l] %u@%d,app=%a [%e] '
          log_lock_waits: 'on'
          log_min_duration_statement: 1s
          log_statement: ddl
          max_connections: 100
          max_prepared_transactions: 150
          shared_preload_libraries: timescaledb,pg_stat_statements
          ssl: 'on'
          ssl_cert_file: /etc/certificate/tls.crt
          ssl_key_file: /etc/certificate/tls.key
          tcp_keepalives_idle: 900
          tcp_keepalives_interval: 100
          temp_file_limit: 1GB
          timescaledb.passfile: ../.pgpass
          unix_socket_directories: /var/run/postgresql
          unix_socket_permissions: '0750'
          wal_level: hot_standby
          wal_log_hints: 'on'
          max_locks_per_transaction: 2200
          max_parallel_workers: 14
          max_worker_processes: 32
          timescaledb.max_background_workers: 16
        use_pg_rewind: true
        use_slots: true
      retry_timeout: 10
      ttl: 30
    method: restore_or_initdb
    post_init: /etc/timescaledb/scripts/post_init.sh
    restore_or_initdb:
      command: >
        /etc/timescaledb/scripts/restore_or_initdb.sh --encoding=UTF8
        --locale=C.UTF-8
      keep_existing_recovery_conf: true
  kubernetes:
    role_label: role
    scope_label: cluster-name
    use_endpoints: true
  log:
    level: WARNING
  postgresql:
    authentication:
      replication:
        username: standby
      superuser:
        username: postgres
    basebackup:
      - waldir: /var/lib/postgresql/wal/pg_wal
    callbacks:
      on_reload: /etc/timescaledb/scripts/patroni_callback.sh
      on_restart: /etc/timescaledb/scripts/patroni_callback.sh
      on_role_change: /etc/timescaledb/scripts/patroni_callback.sh
      on_start: /etc/timescaledb/scripts/patroni_callback.sh
      on_stop: /etc/timescaledb/scripts/patroni_callback.sh
    create_replica_methods:
      - pgbackrest
      - basebackup
    listen: 0.0.0.0:5432
    pg_hba:
      - local     all             postgres                              peer
      - local     all             all                                   md5
      - hostssl   all             all                127.0.0.1/32       md5
      - hostssl   all             all                ::1/128            md5
      - hostssl   replication     standby            all                md5
      - hostssl   all             all                all                md5
      - host      all             all                all                md5
    pgbackrest:
      command: /etc/timescaledb/scripts/pgbackrest_restore.sh
      keep_data: true
      no_master: true
      no_params: true
    recovery_conf:
      restore_command: /etc/timescaledb/scripts/pgbackrest_archive_get.sh %f "%p"
    use_unix_socket: true
  restapi:
    listen: 0.0.0.0:8008
persistentVolumes:
  data:
    accessModes:
      - ReadWriteOnce
    annotations: {}
    enabled: true
    mountPath: /var/lib/postgresql
    size: 25Gi
    subPath: ''
  wal:
    accessModes:
      - ReadWriteOnce
    annotations: {}
    enabled: true
    mountPath: /var/lib/postgresql/wal
    size: 5Gi
    storageClass: null
    subPath: ''
pgBouncer:
  config:
    default_pool_size: 12
    max_client_conn: 500
    pool_mode: transaction
    server_reset_query: DISCARD ALL
    client_tls_sslmode: prefer
    ignore_startup_parameters: extra_float_digits
  enabled: true
  pg_hba:
    - local     all postgres                   peer
    - host      all postgres,standby 0.0.0.0/0 reject
    - host      all postgres,standby ::0/0     reject
    - hostssl   all all              0.0.0.0/0 md5
    - hostssl   all all              ::0/0     md5
    - host      all all              0.0.0.0/0 md5
    - host      all all              ::0/0     md5
  port: 6432
  userListSecretName: null
podAnnotations: {}
podLabels: {}
podManagementPolicy: OrderedReady
podMonitor:
  enabled: false
  interval: 10s
  path: /metrics
postInit:
  - configMap:
      name: custom-init-scripts
      optional: true
  - secret:
      name: custom-secret-scripts
      optional: true
prometheus:
  args: []
  enabled: false
  env: null
  image:
    pullPolicy: Always
    repository: quay.io/prometheuscommunity/postgres-exporter
    tag: v0.11.1
  volumeMounts: null
  volumes: null
rbac:
  create: true
readinessProbe:
  enabled: true
  failureThreshold: 6
  initialDelaySeconds: 5
  periodSeconds: 30
  successThreshold: 1
  timeoutSeconds: 5
replicaCount: 3
resources: {}
secrets:
  certificate:
    tls.crt: ''
    tls.key: ''
  certificateSecretName: certificate
  credentials:
    PATRONI_REPLICATION_PASSWORD: ''
    PATRONI_SUPERUSER_PASSWORD: ''
    PATRONI_admin_PASSWORD: ''
  credentialsSecretName: credentials
  pgbackrest:
    PGBACKREST_REPO1_S3_BUCKET: ''
    PGBACKREST_REPO1_S3_ENDPOINT: s3.amazonaws.com
    PGBACKREST_REPO1_S3_KEY: ''
    PGBACKREST_REPO1_S3_KEY_SECRET: ''
    PGBACKREST_REPO1_S3_REGION: ''
  pgbackrestSecretName: pgbackrest
service:
  primary:
    annotations:
      service.beta.kubernetes.io/aws-load-balancer-connection-idle-timeout: '4000'
      service.beta.kubernetes.io/aws-load-balancer-internal: 'true'
      service.beta.kubernetes.io/aws-load-balancer-type: nlb
    labels: {}
    nodePort: null
    port: 5432
    spec: {}
    type: LoadBalancer
  replica:
    annotations:
      service.beta.kubernetes.io/aws-load-balancer-connection-idle-timeout: '4000'
      service.beta.kubernetes.io/aws-load-balancer-internal: 'true'
      service.beta.kubernetes.io/aws-load-balancer-type: nlb
    labels: {}
    nodePort: null
    port: 5432
    spec: {}
    type: LoadBalancer
serviceAccount:
  annotations: {}
  create: true
  name: null
sharedMemory:
  useMount: true
timescaledbTune:
  args: {}
  enabled: true
tolerations: []
topologySpreadConstraints: []
version: null
global:
  cattle:
    systemProjectId: p-rg64l
loadBalancer:
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-internal: 'true'
    service.beta.kubernetes.io/aws-load-balancer-type: nlb
  enabled: false
replicaLoadBalancer:
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-internal: 'true'
    service.beta.kubernetes.io/aws-load-balancer-type: nlb
  • Kubernetes version information:

    kubectl version

    Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.8", GitCommit:"4a3b558c52eb6995b3c5c1db5e54111bd0645a64", GitTreeState:"clean", BuildDate:"2021-12-15T14:52:11Z", GoVersion:"go1.16.12", Compiler:"gc", Platform:"linux/amd64"}
    Server Version: version.Info{Major:"1", Minor:"22+", GitVersion:"v1.22.15-eks-fb459a0", GitCommit:"be82fa628e60d024275efaa239bfe53a9119c2d9", GitTreeState:"clean", BuildDate:"2022-10-24T20:33:23Z", GoVersion:"go1.16.15", Compiler:"gc", Platform:"linux/amd64"}
    
  • Kubernetes cluster kind:

    insert how you created your cluster: Rancher

Anything else we need to know?: I saw the same error here : https://github.com/timescale/helm-charts/issues/405 But the thread is marked as resolved. But no real solution is proposed to resolve this thread.

I saw the problem is with patroni but I don't Know if timescaledb-ha images using only postgres 12 version are patched

throrin19 avatar Jan 24 '23 07:01 throrin19

This error is a Patroni error that is fixed here. That Patroni fix was merged into the timescaledb-ha image version timescale/timescaledb-ha:pg13.9-ts2.9.2-p0. You will need to update your timescaledb-ha image for that message to go away.

nhudson avatar Jan 24 '23 13:01 nhudson

@nhudson thanks I test that. This image have the postgres 12 version on it ? (settings the correct parameter in the value.yaml)

throrin19 avatar Jan 24 '23 15:01 throrin19

The same error here (with the newest version of the chart (0.33.1)). I did not want to use other image, so I had to create Role and RoleBinding with these values:

cat <<EOF | kubectl apply --context <your_context> -f -
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: <your_namespace>
  name: timescaledb-patch-to-remove
rules:
- apiGroups: [""] # "" indicates the core API group
  resources: ["services"]
  verbs: ["create"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: timescaledb-patch-binding
  namespace: <your_namespace>
subjects:
- kind: ServiceAccount
  name: <your_name>
  namespace: <your_namespace>
roleRef:
  kind: Role
  name: timescaledb-patch-to-remove
  apiGroup: rbac.authorization.k8s.io
EOF

tomislater avatar Jan 27 '23 09:01 tomislater

I had the same issue with

repository: timescale/timescaledb-ha
tag: pg12.13-ts2.9.2-p1

It was fixed by @tomislater patch

jorgror avatar Jan 31 '23 11:01 jorgror

@nhudson It's not normal to use a specific image with a patch not deployed in all new timescaledb-ha tags. Do we have a date or something for when and if this will be rolled out to all future timescaledb-ha tags?

throrin19 avatar Feb 03 '23 07:02 throrin19

Same error with pg14.6-ts2.9.2-p0 and pg14.6-ts2.9.1-patroni-static-primary-p2. Any image on pg14.6 fixing this problem?

dgsardina avatar Feb 07 '23 16:02 dgsardina

Any news about the publication of patroni patch in all new timescaledb-ha images without apply patch or use very specific timescaledb-ha image ? @nhudson

throrin19 avatar Feb 13 '23 09:02 throrin19

Same error with timescaledb-ha:pg12.14-ts2.9.3-latest published 4 days ago 😢

throrin19 avatar Feb 13 '23 09:02 throrin19

@tomislater patch fix the problem. But it's not a solution : normally, with a helm chart, we do not to do this patches.

It's possible to add this directly in the chart to fix the problem in next version ? @feikesteenbergen

throrin19 avatar Feb 13 '23 10:02 throrin19

@throrin19 looks like we are waiting on the Patroni images to become available in the Debian repo. https://github.com/timescale/timescaledb-docker-ha/pull/343

nhudson avatar Feb 13 '23 15:02 nhudson

Is there an update date of patroni in the image timescaledb-ha since a month?

This problem seems to be blocking for most of the current and future users of this chart but no solution seems to arrive quickly (except the role tweak which is to be done in manual but not available on the chart itself).

throrin19 avatar Mar 13 '23 09:03 throrin19

I tried with the latest image that is currently available (pg14.6-ts2.9.3-patroni-dcs-failsafe-p0) and the problem is resolved!

GeorgFleig avatar Mar 21 '23 10:03 GeorgFleig

@GeorgFleig : It's not a final image, it is a specific tag with the fix. Ok it resolves the problem but it's not the solution, the solution is : with default tags used in this charts, patroni works fine without patch in kubernetes

throrin19 avatar Mar 22 '23 10:03 throrin19

Still getting this error in the timescaledb pod logs.

Any news on when the container images that have the patroni fix will be referenced in the helm chart?

pfrydids avatar Sep 12 '23 12:09 pfrydids

@pfrydids You can forgot about this, the project seems not maintained...

throrin19 avatar Sep 13 '23 06:09 throrin19

@throrin19 oh no! A real shame as it seems to be working quite well otherwise

pfrydids avatar Sep 13 '23 09:09 pfrydids

@pfrydids I use it in my organisation, but nobody have response to their issues like 6 months.

throrin19 avatar Sep 15 '23 08:09 throrin19

Well that's a shame. I was integrating this into my cluster and saw this issue occurring but if the project is dead I'll have to look for some alternative. :(

DreamwareDevelopment avatar Sep 29 '23 22:09 DreamwareDevelopment

@DreamwareDevelopment you can fix that with the patch of roles and rolebindings published by @tomislater https://github.com/timescale/helm-charts/issues/554#issuecomment-1406256190

throrin19 avatar Oct 02 '23 08:10 throrin19