tobs
tobs copied to clipboard
Persistent Volume Mount failure on installation
What did you do?
I ran the commands from the quick start guide (helm install --wait tobs1 timescale/tobs) after following all the set up steps. While installing I get the following timeout error
Error: INSTALLATION FAILED: timed out waiting for the condition
Did you expect to see some different? Expected a successful install
Environment All my machines are running Ubuntu 20.04
lsb_release -a
LSB Version: core-11.1.0ubuntu2-noarch:security-11.1.0ubuntu2-noarch
Distributor ID: Ubuntu
Description: Ubuntu 20.04.4 LTS
Release: 20.04
Codename: focal
-
tobs version:
12.0.1 -
Kubernetes version information:
1.24.3 -
Kubernetes cluster kind: kubeadm cluster coordinating a host of local machines. Plenty of storage on each Node. Used kubadm init on my master node. Then I joined the nodes with the suggested command. Running systemd configuration on every node. I used the Callico networking plugin. Default setting on that. Used this command to
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.9.1/cert-manager.yamlto install my cert manager. That also has default setting. I don't think my problem is related to networking but just in case. My Nodes below:
Ready <none> 22h v1.24.3
Ready control-plane 22h v1.24.3
Ready <none> 22h v1.24.3
Ready <none> 21h v1.24.3
Ready <none> 21h v1.24.3
- tobs Logs:
Installation fails with this error:
$ helm install --wait tobs1 timescale/tobs
W0813 16:59:17.957535 625722 warnings.go:70] spec.template.spec.containers[0].env[2].name: duplicate name "TOBS_TELEMETRY_INSTALLED_BY"
W0813 16:59:17.957547 625722 warnings.go:70] spec.template.spec.containers[0].env[3].name: duplicate name "TOBS_TELEMETRY_VERSION"
Error: INSTALLATION FAILED: timed out waiting for the condition
We can see the failing pods below
$ kubectl get pods -n observability
alertmanager-tobs-kube-prometheus-alertmanager-0 2/2 Running 0 16m
alertmanager-tobs-kube-prometheus-alertmanager-1 2/2 Running 0 16m
alertmanager-tobs-kube-prometheus-alertmanager-2 2/2 Running 0 16m
opentelemetry-operator-controller-manager-74cc58dd44-frqmt 2/2 Running 0 16m
prometheus-tobs-kube-prometheus-prometheus-0 0/2 Pending 0 16m
prometheus-tobs-kube-prometheus-prometheus-1 0/2 Pending 0 16m
tobs-kube-prometheus-operator-76797c6f57-rnssq 1/1 Running 0 16m
tobs-opentelemetry-collector-776c8494f4-fx6x7 1/1 Running 0 56m
tobs1-connection-secret-f96hl 0/1 Completed 0 16m
tobs1-grafana-874d94ff9-k2n4w 0/3 Pending 0 16m
tobs1-kube-state-metrics-868cf9b46b-2mkcg 1/1 Running 0 16m
tobs1-opentelemetry-collector-76b46c66b4-hbkvg 1/1 Running 0 28m
tobs1-prometheus-node-exporter-2n2lm 1/1 Running 0 16m
tobs1-prometheus-node-exporter-6cgvd 1/1 Running 0 16m
tobs1-prometheus-node-exporter-8cqwx 1/1 Running 0 16m
tobs1-prometheus-node-exporter-8wxjm 1/1 Running 0 16m
tobs1-prometheus-node-exporter-r9564 1/1 Running 0 16m
tobs1-promscale-799cb7549f-479tz 0/1 CrashLoopBackOff 8 (45s ago) 16m
tobs1-timescaledb-0 0/2 Pending 0 16m
Looking within the pods themselves, they are failing due to storage related faults. Promscale is in a crash loop and looks the worst off but both Prometheus and Timescale have similar problems. They are pending forever with error pod has unbound immediate PersistentVolumeClaims. Do I need a custom persistent volume definition for on-prem cluster? What am I missing here?
kubectl describe pod tobs1-promscale-799cb7549f-479tz
Name: tobs1-promscale-799cb7549f-479tz
Namespace: observability
Priority: 0
Node: [my node]
Start Time: Sat, 13 Aug 2022 16:59:18 -0700
Labels: app=tobs1-promscale
app.kubernetes.io/component=connector
app.kubernetes.io/name=tobs1-promscale
app.kubernetes.io/version=0.13.0
chart=promscale-0.13.0
heritage=Helm
pod-template-hash=799cb7549f
release=tobs1
Annotations: checksum/config: a1171a41877cc559fe699480d7c9bc731055fde6ccbe0b47e5c9a279cfe38962
checksum/connection: d610b61926215912316a5f9c07435dd69b06894ed8e640bbd7c2bc21c51a16fa
cni.projectcalico.org/containerID: f21a351996716188dcc01b730da3cb9a694bc14a988ea85116c1f145e0ee66d3
cni.projectcalico.org/podIP: 172.16.121.21/32
cni.projectcalico.org/podIPs: 172.16.121.21/32
prometheus.io/path: /metrics
prometheus.io/port: 9201
prometheus.io/scrape: false
Status: Running
IP: 172.16.121.21
IPs:
IP: 172.16.121.21
Controlled By: ReplicaSet/tobs1-promscale-799cb7549f
Containers:
promscale:
Container ID: containerd://0e34febd580edd52ad35fe52b4620b6421431ef314eaf9a6b27c4833c3d3f55f
Image: timescale/promscale:0.13.0
Image ID: docker.io/timescale/promscale@sha256:e23fc4cae99fce8770daece006781232478bb6c35d5e671d3d851a237a37980c
Ports: 9201/TCP, 9202/TCP
Host Ports: 0/TCP, 0/TCP
Args:
-config=/etc/promscale/config.yaml
--metrics.high-availability=true
State: Running
Started: Sat, 13 Aug 2022 17:20:24 -0700
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Sat, 13 Aug 2022 17:15:17 -0700
Finished: Sat, 13 Aug 2022 17:15:18 -0700
Ready: False
Restart Count: 9
Requests:
cpu: 30m
memory: 500Mi
Readiness: http-get http://:metrics-port/healthz delay=0s timeout=15s period=15s #success=1 #failure=3
Environment Variables from:
tobs1-promscale Secret Optional: false
Environment:
TOBS_TELEMETRY_INSTALLED_BY: promscale
TOBS_TELEMETRY_VERSION: 0.13.0
TOBS_TELEMETRY_INSTALLED_BY: helm
TOBS_TELEMETRY_VERSION: 0.13.0
TOBS_TELEMETRY_TRACING_ENABLED: true
TOBS_TELEMETRY_TIMESCALEDB_ENABLED: true
Mounts:
/etc/promscale/ from configs (rw)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
configs:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: tobs1-promscale
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 21m default-scheduler Successfully assigned observability/tobs1-promscale-799cb7549f-479tz to [my-node]
Normal Pulled 20m (x4 over 21m) kubelet Container image "timescale/promscale:0.13.0" already present on machine
Normal Created 20m (x4 over 21m) kubelet Created container promscale
Normal Started 20m (x4 over 21m) kubelet Started container promscale
Warning Unhealthy 20m (x6 over 21m) kubelet Readiness probe failed: Get "http://172.16.121.21:9201/healthz": dial tcp 172.16.121.21:9201: connect: connection refused
Warning BackOff 57s (x97 over 21m) kubelet Back-off restarting failed container
For completness here are TimescaleDB failures. Prometheus and Grafana fail with identical errors.
kubectl describe pod tobs1-timescaledb-0
Name: tobs1-timescaledb-0
Namespace: observability
Priority: 0
Node: <none>
Labels: app=tobs1-timescaledb
cluster-name=tobs1
controller-revision-hash=tobs1-timescaledb-6865b75968
release=tobs1
statefulset.kubernetes.io/pod-name=tobs1-timescaledb-0
Annotations: <none>
Status: Pending
IP:
IPs: <none>
Controlled By: StatefulSet/tobs1-timescaledb
Init Containers:
tstune:
Image: timescale/timescaledb-ha:pg14.4-ts2.7.2-p0
Port: <none>
Host Port: <none>
Command:
sh
-c
set -e
[ $CPUS -eq 0 ] && CPUS="${RESOURCES_CPU_LIMIT}"
[ $MEMORY -eq 0 ] && MEMORY="${RESOURCES_MEMORY_LIMIT}"
if [ -f "${PGDATA}/postgresql.base.conf" ] && ! grep "${INCLUDE_DIRECTIVE}" postgresql.base.conf -qxF; then
echo "${INCLUDE_DIRECTIVE}" >> "${PGDATA}/postgresql.base.conf"
fi
touch "${TSTUNE_FILE}"
timescaledb-tune -quiet -pg-version 11 -conf-path "${TSTUNE_FILE}" -cpus "${CPUS}" -memory "${MEMORY}MB" \
-yes
# If there is a dedicated WAL Volume, we want to set max_wal_size to 60% of that volume
# If there isn't a dedicated WAL Volume, we set it to 20% of the data volume
if [ "${RESOURCES_WAL_VOLUME}" = "0" ]; then
WALMAX="${RESOURCES_DATA_VOLUME}"
WALPERCENT=20
else
WALMAX="${RESOURCES_WAL_VOLUME}"
WALPERCENT=60
fi
WALMAX=$(numfmt --from=auto ${WALMAX})
# Wal segments are 16MB in size, in this way we get a "nice" number of the nearest
# 16MB
WALMAX=$(( $WALMAX / 100 * $WALPERCENT / 16777216 * 16 ))
WALMIN=$(( $WALMAX / 2 ))
echo "max_wal_size=${WALMAX}MB" >> "${TSTUNE_FILE}"
echo "min_wal_size=${WALMIN}MB" >> "${TSTUNE_FILE}"
Requests:
cpu: 100m
memory: 2Gi
Environment:
TSTUNE_FILE: /var/run/postgresql/timescaledb.conf
RESOURCES_WAL_VOLUME: 20Gi
RESOURCES_DATA_VOLUME: 150Gi
INCLUDE_DIRECTIVE: include_if_exists = '/var/run/postgresql/timescaledb.conf'
CPUS: 1 (requests.cpu)
MEMORY: 2048 (requests.memory)
RESOURCES_CPU_LIMIT: node allocatable (limits.cpu)
RESOURCES_MEMORY_LIMIT: node allocatable (limits.memory)
Mounts:
/var/run/postgresql from socket-directory (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-q8chn (ro)
Containers:
timescaledb:
Image: timescale/timescaledb-ha:pg14.4-ts2.7.2-p0
Ports: 8008/TCP, 5432/TCP
Host Ports: 0/TCP, 0/TCP
Command:
/bin/bash
-c
install -o postgres -g postgres -d -m 0700 "/var/lib/postgresql/data" "/var/lib/postgresql/wal/pg_wal" || exit 1
TABLESPACES=""
for tablespace in ; do
install -o postgres -g postgres -d -m 0700 "/var/lib/postgresql/tablespaces/${tablespace}/data"
done
# Environment variables can be read by regular users of PostgreSQL. Especially in a Kubernetes
# context it is likely that some secrets are part of those variables.
# To ensure we expose as little as possible to the underlying PostgreSQL instance, we have a list
# of allowed environment variable patterns to retain.
#
# We need the KUBERNETES_ environment variables for the native Kubernetes support of Patroni to work.
#
# NB: Patroni will remove all PATRONI_.* environment variables before starting PostgreSQL
# We store the current environment, as initscripts, callbacks, archive_commands etc. may require
# to have the environment available to them
set -o posix
export -p > "${HOME}/.pod_environment"
export -p | grep PGBACKREST > "${HOME}/.pgbackrest_environment"
for UNKNOWNVAR in $(env | awk -F '=' '!/^(PATRONI_.*|HOME|PGDATA|PGHOST|LC_.*|LANG|PATH|KUBERNETES_SERVICE_.*|AWS_ROLE_ARN|AWS_WEB_IDENTITY_TOKEN_FILE)=/ {print $1}')
do
unset "${UNKNOWNVAR}"
done
touch /var/run/postgresql/timescaledb.conf
touch /var/run/postgresql/wal_status
echo "*:*:*:postgres:${PATRONI_SUPERUSER_PASSWORD}" >> ${HOME}/.pgpass
chmod 0600 ${HOME}/.pgpass
export PATRONI_POSTGRESQL_PGPASS="${HOME}/.pgpass.patroni"
exec patroni /etc/timescaledb/patroni.yaml
Requests:
cpu: 100m
memory: 2Gi
Readiness: exec [pg_isready -h /var/run/postgresql] delay=5s timeout=5s period=30s #success=1 #failure=6
Environment Variables from:
tobs1-credentials Secret Optional: false
tobs1-pgbackrest Secret Optional: true
Environment:
PATRONI_admin_OPTIONS: createrole,createdb
PATRONI_REPLICATION_USERNAME: standby
PATRONI_KUBERNETES_POD_IP: (v1:status.podIP)
PATRONI_POSTGRESQL_CONNECT_ADDRESS: $(PATRONI_KUBERNETES_POD_IP):5432
PATRONI_RESTAPI_CONNECT_ADDRESS: $(PATRONI_KUBERNETES_POD_IP):8008
PATRONI_KUBERNETES_PORTS: [{"name": "postgresql", "port": 5432}]
PATRONI_NAME: tobs1-timescaledb-0 (v1:metadata.name)
PATRONI_POSTGRESQL_DATA_DIR: /var/lib/postgresql/data
PATRONI_KUBERNETES_NAMESPACE: observability
PATRONI_KUBERNETES_LABELS: {app: tobs1-timescaledb, cluster-name: tobs1, release: tobs1}
PATRONI_SCOPE: tobs1
PGBACKREST_CONFIG: /etc/pgbackrest/pgbackrest.conf
PGDATA: $(PATRONI_POSTGRESQL_DATA_DIR)
PGHOST: /var/run/postgresql
BOOTSTRAP_FROM_BACKUP: 0
Mounts:
/etc/certificate from certificate (ro)
/etc/pgbackrest from pgbackrest (ro)
/etc/pgbackrest/bootstrap from pgbackrest-bootstrap (ro)
/etc/timescaledb/patroni.yaml from patroni-config (ro,path="patroni.yaml")
/etc/timescaledb/post_init.d from post-init (ro)
/etc/timescaledb/scripts from timescaledb-scripts (ro)
/var/lib/postgresql from storage-volume (rw)
/var/lib/postgresql/wal from wal-volume (rw)
/var/run/postgresql from socket-directory (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-q8chn (ro)
postgres-exporter:
Image: quay.io/prometheuscommunity/postgres-exporter:v0.11.0
Port: 9187/TCP
Host Port: 0/TCP
Environment:
DATA_SOURCE_NAME: host=/var/run/postgresql user=postgres application_name=postgres_exporter
PG_EXPORTER_CONSTANT_LABELS: release=tobs1,namespace=observability
Mounts:
/var/run/postgresql from socket-directory (ro)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-q8chn (ro)
Conditions:
Type Status
PodScheduled False
Volumes:
storage-volume:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: storage-volume-tobs1-timescaledb-0
ReadOnly: false
wal-volume:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: wal-volume-tobs1-timescaledb-0
ReadOnly: false
socket-directory:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
patroni-config:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: tobs1-timescaledb-patroni
Optional: false
timescaledb-scripts:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: tobs1-timescaledb-scripts
Optional: false
post-init:
Type: Projected (a volume that contains injected data from multiple sources)
ConfigMapName: custom-init-scripts
ConfigMapOptional: 0xc0007c2269
SecretName: custom-secret-scripts
SecretOptionalName: 0xc0007c226a
pgbouncer:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: tobs1-timescaledb-pgbouncer
Optional: true
pgbackrest:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: tobs1-timescaledb-pgbackrest
Optional: true
certificate:
Type: Secret (a volume populated by a Secret)
SecretName: tobs1-certificate
Optional: false
pgbackrest-bootstrap:
Type: Secret (a volume populated by a Secret)
SecretName: pgbackrest-bootstrap
Optional: true
kube-api-access-q8chn:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 4m42s (x5 over 24m) default-scheduler 0/5 nodes are available: 5 pod has unbound immediate PersistentVolumeClaims. preemption: 0/5 nodes are available: 5 Preemption is not helpful for scheduling.
Anything else we need to know?: I'm new to Kubernetes so this is probably not a bug, but rather a misunderstanding of PersistentVolume storage and how I setup my Nodes(VM's). I would love to contribute to docs for similarly lost people and hope I'm not the only person confused by this. Any help would be greatly appreciated.