tobs icon indicating copy to clipboard operation
tobs copied to clipboard

Persistent Volume Mount failure on installation

Open abhinavDhulipala opened this issue 3 years ago • 0 comments

What did you do? I ran the commands from the quick start guide (helm install --wait tobs1 timescale/tobs) after following all the set up steps. While installing I get the following timeout error

Error: INSTALLATION FAILED: timed out waiting for the condition

Did you expect to see some different? Expected a successful install

Environment All my machines are running Ubuntu 20.04

lsb_release -a
LSB Version:    core-11.1.0ubuntu2-noarch:security-11.1.0ubuntu2-noarch
Distributor ID: Ubuntu
Description:    Ubuntu 20.04.4 LTS
Release:        20.04
Codename:       focal
  • tobs version:

    12.0.1

  • Kubernetes version information:

    1.24.3

  • Kubernetes cluster kind: kubeadm cluster coordinating a host of local machines. Plenty of storage on each Node. Used kubadm init on my master node. Then I joined the nodes with the suggested command. Running systemd configuration on every node. I used the Callico networking plugin. Default setting on that. Used this command to kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.9.1/cert-manager.yaml to install my cert manager. That also has default setting. I don't think my problem is related to networking but just in case. My Nodes below:

Ready    <none>          22h   v1.24.3
Ready    control-plane   22h   v1.24.3
Ready    <none>          22h   v1.24.3
Ready    <none>          21h   v1.24.3
Ready    <none>          21h   v1.24.3
  • tobs Logs:

Installation fails with this error:

$ helm install --wait tobs1 timescale/tobs
W0813 16:59:17.957535  625722 warnings.go:70] spec.template.spec.containers[0].env[2].name: duplicate name "TOBS_TELEMETRY_INSTALLED_BY"
W0813 16:59:17.957547  625722 warnings.go:70] spec.template.spec.containers[0].env[3].name: duplicate name "TOBS_TELEMETRY_VERSION"
Error: INSTALLATION FAILED: timed out waiting for the condition

We can see the failing pods below

$ kubectl get pods -n observability
alertmanager-tobs-kube-prometheus-alertmanager-0             2/2     Running            0             16m
alertmanager-tobs-kube-prometheus-alertmanager-1             2/2     Running            0             16m
alertmanager-tobs-kube-prometheus-alertmanager-2             2/2     Running            0             16m
opentelemetry-operator-controller-manager-74cc58dd44-frqmt   2/2     Running            0             16m
prometheus-tobs-kube-prometheus-prometheus-0                 0/2     Pending            0             16m
prometheus-tobs-kube-prometheus-prometheus-1                 0/2     Pending            0             16m
tobs-kube-prometheus-operator-76797c6f57-rnssq               1/1     Running            0             16m
tobs-opentelemetry-collector-776c8494f4-fx6x7                1/1     Running            0             56m
tobs1-connection-secret-f96hl                                0/1     Completed          0             16m
tobs1-grafana-874d94ff9-k2n4w                                0/3     Pending            0             16m
tobs1-kube-state-metrics-868cf9b46b-2mkcg                    1/1     Running            0             16m
tobs1-opentelemetry-collector-76b46c66b4-hbkvg               1/1     Running            0             28m
tobs1-prometheus-node-exporter-2n2lm                         1/1     Running            0             16m
tobs1-prometheus-node-exporter-6cgvd                         1/1     Running            0             16m
tobs1-prometheus-node-exporter-8cqwx                         1/1     Running            0             16m
tobs1-prometheus-node-exporter-8wxjm                         1/1     Running            0             16m
tobs1-prometheus-node-exporter-r9564                         1/1     Running            0             16m
tobs1-promscale-799cb7549f-479tz                             0/1     CrashLoopBackOff   8 (45s ago)   16m
tobs1-timescaledb-0                                          0/2     Pending            0             16m

Looking within the pods themselves, they are failing due to storage related faults. Promscale is in a crash loop and looks the worst off but both Prometheus and Timescale have similar problems. They are pending forever with error pod has unbound immediate PersistentVolumeClaims. Do I need a custom persistent volume definition for on-prem cluster? What am I missing here?

kubectl describe pod tobs1-promscale-799cb7549f-479tz 
Name:         tobs1-promscale-799cb7549f-479tz
Namespace:    observability
Priority:     0
Node:         [my node]
Start Time:   Sat, 13 Aug 2022 16:59:18 -0700
Labels:       app=tobs1-promscale
              app.kubernetes.io/component=connector
              app.kubernetes.io/name=tobs1-promscale
              app.kubernetes.io/version=0.13.0
              chart=promscale-0.13.0
              heritage=Helm
              pod-template-hash=799cb7549f
              release=tobs1
Annotations:  checksum/config: a1171a41877cc559fe699480d7c9bc731055fde6ccbe0b47e5c9a279cfe38962
              checksum/connection: d610b61926215912316a5f9c07435dd69b06894ed8e640bbd7c2bc21c51a16fa
              cni.projectcalico.org/containerID: f21a351996716188dcc01b730da3cb9a694bc14a988ea85116c1f145e0ee66d3
              cni.projectcalico.org/podIP: 172.16.121.21/32
              cni.projectcalico.org/podIPs: 172.16.121.21/32
              prometheus.io/path: /metrics
              prometheus.io/port: 9201
              prometheus.io/scrape: false
Status:       Running
IP:           172.16.121.21
IPs:
  IP:           172.16.121.21
Controlled By:  ReplicaSet/tobs1-promscale-799cb7549f
Containers:
  promscale:
    Container ID:  containerd://0e34febd580edd52ad35fe52b4620b6421431ef314eaf9a6b27c4833c3d3f55f
    Image:         timescale/promscale:0.13.0
    Image ID:      docker.io/timescale/promscale@sha256:e23fc4cae99fce8770daece006781232478bb6c35d5e671d3d851a237a37980c
    Ports:         9201/TCP, 9202/TCP
    Host Ports:    0/TCP, 0/TCP
    Args:
      -config=/etc/promscale/config.yaml
      --metrics.high-availability=true
    State:          Running
      Started:      Sat, 13 Aug 2022 17:20:24 -0700
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Sat, 13 Aug 2022 17:15:17 -0700
      Finished:     Sat, 13 Aug 2022 17:15:18 -0700
    Ready:          False
    Restart Count:  9
    Requests:
      cpu:      30m
      memory:   500Mi
    Readiness:  http-get http://:metrics-port/healthz delay=0s timeout=15s period=15s #success=1 #failure=3
    Environment Variables from:
      tobs1-promscale  Secret  Optional: false
    Environment:
      TOBS_TELEMETRY_INSTALLED_BY:         promscale
      TOBS_TELEMETRY_VERSION:              0.13.0
      TOBS_TELEMETRY_INSTALLED_BY:         helm
      TOBS_TELEMETRY_VERSION:              0.13.0
      TOBS_TELEMETRY_TRACING_ENABLED:      true
      TOBS_TELEMETRY_TIMESCALEDB_ENABLED:  true
    Mounts:
      /etc/promscale/ from configs (rw)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  configs:
    Type:        ConfigMap (a volume populated by a ConfigMap)
    Name:        tobs1-promscale
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age                 From               Message
  ----     ------     ----                ----               -------
  Normal   Scheduled  21m                 default-scheduler  Successfully assigned observability/tobs1-promscale-799cb7549f-479tz to [my-node]
  Normal   Pulled     20m (x4 over 21m)   kubelet            Container image "timescale/promscale:0.13.0" already present on machine
  Normal   Created    20m (x4 over 21m)   kubelet            Created container promscale
  Normal   Started    20m (x4 over 21m)   kubelet            Started container promscale
  Warning  Unhealthy  20m (x6 over 21m)   kubelet            Readiness probe failed: Get "http://172.16.121.21:9201/healthz": dial tcp 172.16.121.21:9201: connect: connection refused
  Warning  BackOff    57s (x97 over 21m)  kubelet            Back-off restarting failed container

For completness here are TimescaleDB failures. Prometheus and Grafana fail with identical errors.

kubectl describe pod tobs1-timescaledb-0
Name:           tobs1-timescaledb-0
Namespace:      observability
Priority:       0
Node:           <none>
Labels:         app=tobs1-timescaledb
                cluster-name=tobs1
                controller-revision-hash=tobs1-timescaledb-6865b75968
                release=tobs1
                statefulset.kubernetes.io/pod-name=tobs1-timescaledb-0
Annotations:    <none>
Status:         Pending
IP:             
IPs:            <none>
Controlled By:  StatefulSet/tobs1-timescaledb
Init Containers:
  tstune:
    Image:      timescale/timescaledb-ha:pg14.4-ts2.7.2-p0
    Port:       <none>
    Host Port:  <none>
    Command:
      sh
      -c
      set -e
      [ $CPUS -eq 0 ]   && CPUS="${RESOURCES_CPU_LIMIT}"
      [ $MEMORY -eq 0 ] && MEMORY="${RESOURCES_MEMORY_LIMIT}"
      
      if [ -f "${PGDATA}/postgresql.base.conf" ] && ! grep "${INCLUDE_DIRECTIVE}" postgresql.base.conf -qxF; then
        echo "${INCLUDE_DIRECTIVE}" >> "${PGDATA}/postgresql.base.conf"
      fi
      
      touch "${TSTUNE_FILE}"
      timescaledb-tune -quiet -pg-version 11 -conf-path "${TSTUNE_FILE}" -cpus "${CPUS}" -memory "${MEMORY}MB" \
         -yes
      
      # If there is a dedicated WAL Volume, we want to set max_wal_size to 60% of that volume
      # If there isn't a dedicated WAL Volume, we set it to 20% of the data volume
      if [ "${RESOURCES_WAL_VOLUME}" = "0" ]; then
        WALMAX="${RESOURCES_DATA_VOLUME}"
        WALPERCENT=20
      else
        WALMAX="${RESOURCES_WAL_VOLUME}"
        WALPERCENT=60
      fi
      
      WALMAX=$(numfmt --from=auto ${WALMAX})
      
      # Wal segments are 16MB in size, in this way we get a "nice" number of the nearest
      # 16MB
      WALMAX=$(( $WALMAX / 100 * $WALPERCENT / 16777216 * 16 ))
      WALMIN=$(( $WALMAX / 2 ))
      
      echo "max_wal_size=${WALMAX}MB" >> "${TSTUNE_FILE}"
      echo "min_wal_size=${WALMIN}MB" >> "${TSTUNE_FILE}"
      
    Requests:
      cpu:     100m
      memory:  2Gi
    Environment:
      TSTUNE_FILE:             /var/run/postgresql/timescaledb.conf
      RESOURCES_WAL_VOLUME:    20Gi
      RESOURCES_DATA_VOLUME:   150Gi
      INCLUDE_DIRECTIVE:       include_if_exists = '/var/run/postgresql/timescaledb.conf'
      CPUS:                    1 (requests.cpu)
      MEMORY:                  2048 (requests.memory)
      RESOURCES_CPU_LIMIT:     node allocatable (limits.cpu)
      RESOURCES_MEMORY_LIMIT:  node allocatable (limits.memory)
    Mounts:
      /var/run/postgresql from socket-directory (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-q8chn (ro)
Containers:
  timescaledb:
    Image:       timescale/timescaledb-ha:pg14.4-ts2.7.2-p0
    Ports:       8008/TCP, 5432/TCP
    Host Ports:  0/TCP, 0/TCP
    Command:
      /bin/bash
      -c
      
      install -o postgres -g postgres -d -m 0700 "/var/lib/postgresql/data" "/var/lib/postgresql/wal/pg_wal" || exit 1
      TABLESPACES=""
      for tablespace in ; do
        install -o postgres -g postgres -d -m 0700 "/var/lib/postgresql/tablespaces/${tablespace}/data"
      done
      
      # Environment variables can be read by regular users of PostgreSQL. Especially in a Kubernetes
      # context it is likely that some secrets are part of those variables.
      # To ensure we expose as little as possible to the underlying PostgreSQL instance, we have a list
      # of allowed environment variable patterns to retain.
      #
      # We need the KUBERNETES_ environment variables for the native Kubernetes support of Patroni to work.
      #
      # NB: Patroni will remove all PATRONI_.* environment variables before starting PostgreSQL
      
      # We store the current environment, as initscripts, callbacks, archive_commands etc. may require
      # to have the environment available to them
      set -o posix
      export -p > "${HOME}/.pod_environment"
      export -p | grep PGBACKREST > "${HOME}/.pgbackrest_environment"
      
      for UNKNOWNVAR in $(env | awk -F '=' '!/^(PATRONI_.*|HOME|PGDATA|PGHOST|LC_.*|LANG|PATH|KUBERNETES_SERVICE_.*|AWS_ROLE_ARN|AWS_WEB_IDENTITY_TOKEN_FILE)=/ {print $1}')
      do
          unset "${UNKNOWNVAR}"
      done
      
      touch /var/run/postgresql/timescaledb.conf
      touch /var/run/postgresql/wal_status
      
      echo "*:*:*:postgres:${PATRONI_SUPERUSER_PASSWORD}" >> ${HOME}/.pgpass
      chmod 0600 ${HOME}/.pgpass
      
      export PATRONI_POSTGRESQL_PGPASS="${HOME}/.pgpass.patroni"
      
      exec patroni /etc/timescaledb/patroni.yaml
      
    Requests:
      cpu:      100m
      memory:   2Gi
    Readiness:  exec [pg_isready -h /var/run/postgresql] delay=5s timeout=5s period=30s #success=1 #failure=6
    Environment Variables from:
      tobs1-credentials  Secret  Optional: false
      tobs1-pgbackrest   Secret  Optional: true
    Environment:
      PATRONI_admin_OPTIONS:               createrole,createdb
      PATRONI_REPLICATION_USERNAME:        standby
      PATRONI_KUBERNETES_POD_IP:            (v1:status.podIP)
      PATRONI_POSTGRESQL_CONNECT_ADDRESS:  $(PATRONI_KUBERNETES_POD_IP):5432
      PATRONI_RESTAPI_CONNECT_ADDRESS:     $(PATRONI_KUBERNETES_POD_IP):8008
      PATRONI_KUBERNETES_PORTS:            [{"name": "postgresql", "port": 5432}]
      PATRONI_NAME:                        tobs1-timescaledb-0 (v1:metadata.name)
      PATRONI_POSTGRESQL_DATA_DIR:         /var/lib/postgresql/data
      PATRONI_KUBERNETES_NAMESPACE:        observability
      PATRONI_KUBERNETES_LABELS:           {app: tobs1-timescaledb, cluster-name: tobs1, release: tobs1}
      PATRONI_SCOPE:                       tobs1
      PGBACKREST_CONFIG:                   /etc/pgbackrest/pgbackrest.conf
      PGDATA:                              $(PATRONI_POSTGRESQL_DATA_DIR)
      PGHOST:                              /var/run/postgresql
      BOOTSTRAP_FROM_BACKUP:               0
    Mounts:
      /etc/certificate from certificate (ro)
      /etc/pgbackrest from pgbackrest (ro)
      /etc/pgbackrest/bootstrap from pgbackrest-bootstrap (ro)
      /etc/timescaledb/patroni.yaml from patroni-config (ro,path="patroni.yaml")
      /etc/timescaledb/post_init.d from post-init (ro)
      /etc/timescaledb/scripts from timescaledb-scripts (ro)
      /var/lib/postgresql from storage-volume (rw)
      /var/lib/postgresql/wal from wal-volume (rw)
      /var/run/postgresql from socket-directory (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-q8chn (ro)
  postgres-exporter:
    Image:      quay.io/prometheuscommunity/postgres-exporter:v0.11.0
    Port:       9187/TCP
    Host Port:  0/TCP
    Environment:
      DATA_SOURCE_NAME:             host=/var/run/postgresql user=postgres application_name=postgres_exporter
      PG_EXPORTER_CONSTANT_LABELS:  release=tobs1,namespace=observability
    Mounts:
      /var/run/postgresql from socket-directory (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-q8chn (ro)
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  storage-volume:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  storage-volume-tobs1-timescaledb-0
    ReadOnly:   false
  wal-volume:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  wal-volume-tobs1-timescaledb-0
    ReadOnly:   false
  socket-directory:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  patroni-config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      tobs1-timescaledb-patroni
    Optional:  false
  timescaledb-scripts:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      tobs1-timescaledb-scripts
    Optional:  false
  post-init:
    Type:                Projected (a volume that contains injected data from multiple sources)
    ConfigMapName:       custom-init-scripts
    ConfigMapOptional:   0xc0007c2269
    SecretName:          custom-secret-scripts
    SecretOptionalName:  0xc0007c226a
  pgbouncer:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      tobs1-timescaledb-pgbouncer
    Optional:  true
  pgbackrest:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      tobs1-timescaledb-pgbackrest
    Optional:  true
  certificate:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  tobs1-certificate
    Optional:    false
  pgbackrest-bootstrap:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  pgbackrest-bootstrap
    Optional:    true
  kube-api-access-q8chn:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age                  From               Message
  ----     ------            ----                 ----               -------
  Warning  FailedScheduling  4m42s (x5 over 24m)  default-scheduler  0/5 nodes are available: 5 pod has unbound immediate PersistentVolumeClaims. preemption: 0/5 nodes are available: 5 Preemption is not helpful for scheduling.

Anything else we need to know?: I'm new to Kubernetes so this is probably not a bug, but rather a misunderstanding of PersistentVolume storage and how I setup my Nodes(VM's). I would love to contribute to docs for similarly lost people and hope I'm not the only person confused by this. Any help would be greatly appreciated.

abhinavDhulipala avatar Aug 14 '22 00:08 abhinavDhulipala