stolon
stolon copied to clipboard
Add flag to avoid lock on data dir
What would you like to be added: We would like to add a flag to avoid the lock on data dir because the syscall F_SETLK doesnt't work.
Why is this needed: We have tryed to deploy a pod with stolon keeper so defined:
# apiVersion: apps/v1alpha1
# kind: PetSet
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: stolon-keeper
namespace: default
spec:
serviceName: "stolon-keeper"
replicas: 3
selector:
matchLabels:
component: stolon-keeper
stolon-cluster: stolon-cluster-default
template:
metadata:
labels:
component: stolon-keeper
stolon-cluster: stolon-cluster-default
annotations:
pod.alpha.kubernetes.io/initialized: "true"
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
spec:
terminationGracePeriodSeconds: 10
containers:
- name: stolon-keeper
image: sorintlab/stolon:v0.16.0-pg12
command:
- "/bin/bash"
- "-ec"
- |
# Generate our keeper uid using the pod index
IFS='-' read -ra ADDR <<< "$(hostname)"
export STKEEPER_UID="keeper${ADDR[-1]}"
export POD_IP=$(hostname -i)
export STKEEPER_PG_LISTEN_ADDRESS=$POD_IP
export STOLON_DATA=/stolon-data
chown stolon:stolon $STOLON_DATA
exec gosu stolon stolon-keeper --data-dir $STOLON_DATA
env:
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: STKEEPER_CLUSTER_NAME
valueFrom:
fieldRef:
fieldPath: metadata.labels['stolon-cluster']
- name: STKEEPER_STORE_BACKEND
value: "kubernetes"
- name: STKEEPER_KUBE_RESOURCE_KIND
value: "configmap"
- name: STKEEPER_PG_REPL_USERNAME
value: "repluser"
# Or use a password file like in the below supersuser password
- name: STKEEPER_PG_REPL_PASSWORD
value: "replpassword"
- name: STKEEPER_PG_SU_USERNAME
value: "stolon"
- name: STKEEPER_PG_SU_PASSWORDFILE
value: "/etc/secrets/stolon/password"
- name: STKEEPER_METRICS_LISTEN_ADDRESS
value: "0.0.0.0:8080"
# Uncomment this to enable debug logs
#- name: STKEEPER_DEBUG
# value: `"true"`
ports:
- containerPort: 5432
- containerPort: 8080
volumeMounts:
- mountPath: /stolon-data
name: stolon-persistent-storage
- mountPath: /etc/secrets/stolon
name: stolon
volumes:
- name: stolon
secret:
secretName: stolon
# Define your own volumeClaimTemplate. This example uses dynamic PV provisioning with a storage class named "standard" (so it will works by default with minikube)
# In production you should use your own defined storage-class and configure your persistent volumes (statically or dynamically using a provisioner, see related k8s doc).
volumeClaimTemplates:
- metadata:
name: stolon-persistent-storage
spec:
storageClassName: managed-nfs-storage
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 50Gi
We had this log errors:
$ kubectl logs stolon-keeper-2
2020-12-24T14:36:39.201Z WARN cmd/keeper.go:182 password file permissions are too open. This file should only be readable to the user executing stolon! Continuing... {"file": "/etc/secrets/stolon/password", "mode": "01000000777"}
2020-12-24T14:36:39.210Z FATAL cmd/keeper.go:2036 cannot take exclusive lock on data dir "/stolon-data/lock": input/output error
We are working to an implementation: https://github.com/sorintlab/stolon/pull/813
@alessandro-sorint you should investigate why your nfs server/client doesn't support locking. NFSv4 should have it enabled by default.
We could add a flag to disable locking but it should be marked as dangerous in the description since it's a workaround on underlying storage issues and it'll cause data corruptions if two keepers are concurrently running with the same data dir.
Tanks @sgotti we use in our configuration volumes different for evry keeper, so I think it's not a problem to remove the file locking
Tanks @sgotti we use in our configuration volumes different for evry keeper, so I think it's not a problem to remove the file locking
There's always the probability that two keepers will run on the same data dir for multiple reasons (wrong configuration, user error etc...). The real solution is to fix the filesystem locking issues but if it's not possible I'm ok to add an option but with a big warning like explained above.
We did it! https://github.com/sorintlab/stolon/pull/817
Is it enough clear the message of warning? Should we add a log of warning? Thanks
I had the same error when one of the nodes was disconnected and reconnected.