k3s
k3s copied to clipboard
k3s: Error syncing job: rpc error: code = Unknown desc = database is locked
Production Environment: K3s Version: k3s:v1.21.4-k3s1
Node(s) CPU architecture, OS, and Version: Linux ip-10-151-40-235.ec2.internal 5.4.188-104.359.amzn2.x86_64 #1 SMP Thu Apr 14 20:53:13 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Cluster Configuration: Eks clusters having 4 spot nodes.
Describe the bug: We noticed our workflows are failing due to database lock failure and found this to be happening due to k3s. Would be great if we get a solution.
Steps To Reproduce: Not sure on how to reproduce this. Here, we are deploying our clusters via Loft (https://loft.sh/).
A virtual Kubernetes cluster in Loft is tied to a single namespace. The virtual cluster and hypervisor run within a single pod that consists of two parts:
- A k3s instance which contains the Kubernetes control plane (API server, controller manager and etcd) for the virtual cluster
- An instance of a virtual cluster hypervisor which is mainly responsible for syncing cluster resources between the k3s powered virtual cluster and the underlying host cluster. (Ref: https://loft.sh/docs/1.15/self-service/virtual-clusters)
Please find the k3s logs below:
E0812 10:58:56.903092 7 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"}: http: Handler timeout
time="2022-08-12T10:59:00.173157384Z" level=error msg="error in txn: context canceled"
E0812 11:00:01.772938 7 cronjob_controllerv2.go:154] error syncing CronJobController atlas/atlas-statsd-job, requeuing: Operation cannot be fulfilled on cronjobs.batch "atlas-statsd-job": the object has been modified; please apply your changes to the latest version and try again
E0812 11:00:06.640284 7 cronjob_controllerv2.go:154] error syncing CronJobController monitoring/prometheus-health-monitor, requeuing: Operation cannot be fulfilled on cronjobs.batch "prometheus-health-monitor": the object has been modified; please apply your changes to the latest version and try again
time="2022-08-12T11:00:46.209517234Z" level=error msg="error in txn: database is locked"
E0812 11:00:46.209711 7 status.go:71] apiserver received an error that is not an metav1.Status: &status.statusError{Code:2, Message:"database is locked", Details:[]*anypb.Any(nil), XXX_NoUnkeyedLiteral:struct {}{}, XXX_unrecognized:[]uint8(nil), XXX_sizecache:0}: rpc error: code = Unknown desc = database is locked
E0812 11:03:57.679420 7 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"}: http: Handler timeout
time="2022-08-12T11:03:57.815170693Z" level=error msg="error in txn: context canceled"
time="2022-08-12T11:04:01.362475215Z" level=error msg="Compact failed: failed to compact to revision 48700823: database is locked"
E0812 11:10:00.129181 7 cronjob_controllerv2.go:154] error syncing CronJobController atlas/atlas-statsd-job, requeuing: Operation cannot be fulfilled on cronjobs.batch "atlas-statsd-job": the object has been modified; please apply your changes to the latest version and try again
E0812 11:15:00.143464 7 cronjob_controllerv2.go:154] error syncing CronJobController monitoring/prometheus-health-monitor, requeuing: Operation cannot be fulfilled on cronjobs.batch "prometheus-health-monitor": the object has been modified; please apply your changes to the latest version and try again
E0812 11:18:06.399401 7 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"}: http: Handler timeout
E0812 11:20:00.138191 7 cronjob_controllerv2.go:154] error syncing CronJobController atlas/atlas-statsd-job, requeuing: Operation cannot be fulfilled on cronjobs.batch "atlas-statsd-job": the object has been modified; please apply your changes to the latest version and try again
time="2022-08-12T11:29:01.370345967Z" level=error msg="Compact failed: failed to compact to revision 48724650: database is locked"
E0812 11:29:10.846344 7 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"}: http: Handler timeout
E0812 11:30:02.035306 7 cronjob_controllerv2.go:154] error syncing CronJobController atlas/atlas-statsd-job, requeuing: Operation cannot be fulfilled on cronjobs.batch "atlas-statsd-job": the object has been modified; please apply your changes to the latest version and try again
E0812 11:30:04.198513 7 cronjob_controllerv2.go:154] error syncing CronJobController monitoring/prometheus-health-monitor, requeuing: Operation cannot be fulfilled on cronjobs.batch "prometheus-health-monitor": the object has been modified; please apply your changes to the latest version and try again
time="2022-08-12T11:30:08.967942410Z" level=error msg="error in txn: database is locked"
E0812 11:30:08.968146 7 status.go:71] apiserver received an error that is not an metav1.Status: &status.statusError{Code:2, Message:"database is locked", Details:[]*anypb.Any(nil), XXX_NoUnkeyedLiteral:struct {}{}, XXX_unrecognized:[]uint8(nil), XXX_sizecache:0}: rpc error: code = Unknown desc = database is locked
E0812 11:30:08.968653 7 job_controller.go:866] rpc error: code = Unknown desc = database is locked
I0812 11:30:08.968741 7 event.go:291] "Event occurred" object="monitoring/prometheus-health-monitor-27671730" kind="Job" apiVersion="batch/v1" type="Warning" reason="FailedCreate" message="Error creating: rpc error: code = Unknown desc = database is locked"
E0812 11:30:08.969184 7 job_controller.go:404] Error syncing job: rpc error: code = Unknown desc = database is locked
The "database is locked" error indicates that sqlite's locks are running into conflicts, which is not something that should happen unless something strange is going on.
What is the backing filesystem for the K3s container? The only time I've ever seen this is when using k3s with kine/sqlite on a backing filesystem with poor locking semantics - NFS in particular, or when trying to run multiple copies of K3s against the same local filesystem.
We are using file system as EBS volume and mounting it as pvc
What is the filesystem on the EBS volume, and what is the volume type (gp2, st1, so on)? Are you passing K3s any additional datastore options that might change the sqlite locking semantics?
Also, can you reproduce this on more recent versions of k3s? The 1.21 branch is end of life.
Filesystem on EBS Volume: ext4 Volume Type: gp2 Not using any additional datastore options for k3s. You mean should we try upgrading to v1.24.3+k3s1 and test if this is happening?
I would give that a try, although I regularly run K3s on EC2 instances with gp2 volumes and don't have any issues, so I suspect perhaps something else is going on. Can you provide the full yaml output of the k3s pod spec you're using?
vCluster Loft: https://loft.sh/docs/getting-started/explore/virtual-clusters We are using vcluster template and we use the k3s image in it. Do you want the yaml output of that? @brandond
Does that end up running k3s directly in a pod? If so then yes.
Please find file attached. vcluster-pod.txt
Can you show the volume/pvc for this?
volumes:
- name: data
persistentVolumeClaim:
claimName: data-onboarding-test15-0
❯ k get pvc data-onboarding-test15-0 -o yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
annotations:
pv.kubernetes.io/bind-completed: "yes"
pv.kubernetes.io/bound-by-controller: "yes"
volume.beta.kubernetes.io/storage-provisioner: kubernetes.io/aws-ebs
volume.kubernetes.io/selected-node: <node-ip.xxxx>
finalizers:
- kubernetes.io/pvc-protection
labels:
app: vcluster
release: onboarding-test15
name: data-onboarding-test15-0
namespace: vcluster-onboarding-test15
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 50Gi
storageClassName: gp2
volumeMode: Filesystem
volumeName: pvc-fdfa68be-0984-4f84-a484-abe45f1bb658
status:
accessModes:
- ReadWriteOnce
capacity:
storage: 50Gi
phase: Bound
Hmm, that all looks pretty standard. How much load are you putting on that vcluster? The way GP2 throughput provisioning works, at 50GB you're only guaranteed 150 IOPS; is it possible you're exceeding that and the disk itself is blocking? The cloudwatch metrics should be able to tell you that pretty quickly.
@brandond Have checked with Write throughput (Ops/s) for the affected EBS volumes, noticed it's exceeding the 150 IOPS and dropping, and if not it's almost on the edge of guranteed IOPS value. This could be the major reason causing this. What should we consider next as this does not allow to use custom IOPS with the type "gp2".

Your infrastructure provider has placed an upper bound on disk IOPS and you're exceeding it. I am not sure that there is anything to be done on the K3s side; you need to ensure that your storage is sufficient for the workload you're putting on it.
I would probably either change to one of the Provisioned IOPS volume types, or (if you are stuck on General Purpose volumes) over-provision the volume size such that the allocated IOPS meets your required throughput. See: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/general-purpose.html#gp2-performance
I'm testing this currently by modifying the volume to "gp3" from "gp2" to have higher IOPS (ex: 3000) allocated and observing the behavior for the same.
@brandond I have taken IOPS as a sum of Write throughput from the metrics per duration as mentioned in above screenshot, is that correct way of calculating that?
Total disk IOPS would be the sum of all read and write ops. For disk types that support burstable iops, I think there's another metric for the volume credits that you might also look at. We're working at the edge of my knowledge of EC2 though; you might need to consult AWS support for better advice.