`src/vfs.c:1271: vfsMainFileClose: Assertion '(f->database->shm.lock[0] == 0)'`
In the LXD test suite, during the clustering_handover, an assertion failure in dqlite was found:
2025-10-09T23:08:03.9744500Z + lxd shutdown
2025-10-09T23:08:03.9761816Z time="2025-10-09T23:08:03Z" level=info msg="Found cluster member whose role needs to be changed" candidateAddress="100.64.1.103:8443" local="100.64.1.101:8443" newRole=spare
2025-10-09T23:08:03.9765596Z time="2025-10-09T23:08:03Z" level=info msg="Demoting offline member during rebalance" candidateAddress="100.64.1.103:8443"
2025-10-09T23:08:03.9890740Z time="2025-10-09T23:08:03Z" level=info msg="Completed heartbeat round" duration=142.222769ms local="100.64.1.101:8443"
2025-10-09T23:08:03.9943000Z time="2025-10-09T23:08:03Z" level=info msg="Starting heartbeat round" local="100.64.1.101:8443" mode=initial
2025-10-09T23:08:03.9952887Z time="2025-10-09T23:08:03Z" level=warning msg="Failed heartbeat" err="Failed to send heartbeat request: Put \"https://100.64.1.103:8443/internal/database\": dial tcp 100.64.1.103:8443: connect: connection refused" remote="100.64.1.103:8443"
2025-10-09T23:08:04.0008186Z time="2025-10-09T23:08:04Z" level=info msg="Partial heartbeat received" local="100.64.1.104:8443"
2025-10-09T23:08:04.0011674Z time="2025-10-09T23:08:04Z" level=info msg="Partial heartbeat received" local="100.64.1.102:8443"
2025-10-09T23:08:04.0025965Z time="2025-10-09T23:08:04Z" level=info msg="Asked to shutdown by API" force=false
2025-10-09T23:08:04.0029878Z time="2025-10-09T23:08:04Z" level=info msg="Starting shutdown sequence" signal="power failure"
2025-10-09T23:08:04.0031115Z time="2025-10-09T23:08:04Z" level=info msg="Unwatching path" driver=inotify path=/var/lib/ubuntu-advantage/interfaces/lxd-config.json
2025-10-09T23:08:04.0032123Z time="2025-10-09T23:08:04Z" level=info msg="Transferring leadership" address="100.64.1.101:8443"
2025-10-09T23:08:04.0050157Z time="2025-10-09T23:08:04Z" level=warning msg="Failed heartbeat" err="Failed to send heartbeat request: Put \"https://100.64.1.103:8443/internal/database\": dial tcp 100.64.1.103:8443: connect: connection refused" remote="100.64.1.103:8443"
2025-10-09T23:08:04.0089964Z lxd: src/vfs.c:1271: vfsMainFileClose: Assertion `(f->database->shm.lock[0] == 0)' failed.
2025-10-09T23:08:04.0119308Z time="2025-10-09T23:08:04Z" level=warning msg="Failed adding member event listener client" err="dial tcp 100.64.1.103:8443: connect: connection refused" local="100.64.1.104:8443" remote="100.64.1.103:8443"
2025-10-09T23:08:04.0123219Z time="2025-10-09T23:08:04Z" level=warning msg="Failed adding member event listener client" err="dial tcp 100.64.1.103:8443: connect: connection refused" local="100.64.1.102:8443" remote="100.64.1.103:8443"
2025-10-09T23:08:04.0125165Z time="2025-10-09T23:08:04Z" level=info msg="Dqlite connected outbound" local="100.64.1.102:40880" name=raft remote="100.64.1.104:8443"
2025-10-09T23:08:04.0140110Z time="2025-10-09T23:08:04Z" level=info msg="Dqlite proxy started" local="100.64.1.104:8443" name=dqlite remote="100.64.1.1:40880"
2025-10-09T23:08:04.0143688Z time="2025-10-09T23:08:04Z" level=info msg="Dqlite proxy started" local="100.64.1.102:40880" name=raft remote="100.64.1.104:8443"
2025-10-09T23:08:04.0172181Z time="2025-10-09T23:08:04Z" level=info msg="Dqlite connected outbound" local="100.64.1.104:33382" name=raft remote="100.64.1.102:8443"
2025-10-09T23:08:04.0187323Z time="2025-10-09T23:08:04Z" level=info msg="Dqlite proxy started" local="100.64.1.102:8443" name=dqlite remote="100.64.1.1:33382"
2025-10-09T23:08:04.0191223Z time="2025-10-09T23:08:04Z" level=info msg="Dqlite proxy started" local="100.64.1.104:33382" name=raft remote="100.64.1.102:8443"
2025-10-09T23:08:05.2390792Z Error: Put "http://unix.socket/internal/shutdown?force=false": EOF
2025-10-09T23:08:05.2392846Z time="2025-10-09T23:08:05Z" level=info msg="Dqlite proxy stopped" local="100.64.1.104:8443" name=dqlite remote="100.64.1.1:52886"
2025-10-09T23:08:05.2394393Z time="2025-10-09T23:08:05Z" level=info msg="Dqlite proxy stopped" local="100.64.1.102:8443" name=dqlite remote="100.64.1.1:40832"
2025-10-09T23:08:05.2395726Z time="2025-10-09T23:08:05Z" level=warning msg="Dqlite proxy failed" err="first: remote -> local: read tcp 100.64.1.102:42778->100.64.1.101:8443: read: connection reset by peer" local="100.64.1.102:42778" name=raft remote="100.64.1.101:8443"
2025-10-09T23:08:05.2396937Z time="2025-10-09T23:08:05Z" level=info msg="Dqlite proxy stopped" local="100.64.1.102:42778" name=raft remote="100.64.1.101:8443"
2025-10-09T23:08:05.2397788Z time="2025-10-09T23:08:05Z" level=info msg="Dqlite proxy stopped" local="100.64.1.104:33280" name=raft remote="100.64.1.101:8443"
2025-10-09T23:08:05.2404614Z suites/clustering.sh: line 2820: 32574 Aborted (core dumped) nsenter -n -m -t "${pid}" lxd --logfile "${LXD_DIR}/lxd.log" "${SERVER_DEBUG-}" "$@" 2>&1
2025-10-09T23:08:05.2405206Z + cleanup
There are many failures in the other tests ran with that PR but they are unrelated to this rare dqlite issue.
The additional instrumentation worked as I now got a longer trace:
0x7f539a4d6584 dqlite_fail
src/lib/assert.c:19
0x7f539a4e306f vfsMainFileClose
src/vfs.c:1271
0x7f539a5f2c12 ???
???:0
0x7f539a602f42 ???
???:0
0x7f539a588c07 ???
???:0
0x7f539a5f8197 ???
???:0
0x7f539a637bf9 ???
???:0
0x7f539a4deb26 stmt__close
src/stmt.c:22
0x7f539a4deb26 stmt__registry_close
src/stmt.c:32
0x7f539a4d1752 gateway__leader_close_cb
src/gateway.c:56
0x7f539a4d5205 leader_finalize
src/leader.c:98
0x7f539a4d5c07 exec_tick
src/leader.c:594
0x7f539a505ec6 uvAsyncAfterWorkCb
src/raft/uv_work.c:33
0x7f539a449c41 ???
???:0
0x7f539a4411ac ???
???:0
0x7f539a45d0fd ???
???:0
0x7f539a446c47 ???
???:0
0x7f539a4dbb47 taskRun
src/server.c:711
0x7f539a4dbb47 taskStart
src/server.c:774
0x7f539a294ac2 ???
???:0
0x7f539a3268bf ???
???:0
0xffffffffffffffff ???
???:0
lxd: src/vfs.c:1271: vfsMainFileClose: Assertion `(f->database->shm.lock[0] == 0)' failed.
Hey! This trace helped, but unfortunately it wasn't enough to find the bug so far.
We are adding some more instrumentation to get an idea of what the cluster was doing right before crashing (#858).
I hope that will help us to find out where the problem is, sorry for the delay.
I've hit it again in my fork in the clustering_rebalance test:
time="2025-11-12T19:33:26Z" level=info msg="Dqlite proxy started" local="100.64.1.103:45018" name=raft remote="100.64.1.102:8443"
time="2025-11-12T19:33:26Z" level=info msg="Dqlite connected outbound" local="100.64.1.102:36090" name=raft remote="100.64.1.103:8443"
time="2025-11-12T19:33:26Z" level=info msg="Dqlite proxy started" local="100.64.1.103:8443" name=dqlite remote="100.64.1.1:36090"
time="2025-11-12T19:33:26Z" level=info msg="Dqlite proxy started" local="100.64.1.102:36090" name=raft remote="100.64.1.103:8443"
0x7ff3c74cb584 dqlite_fail
src/lib/assert.c:19
0x7ff3c74d806f vfsMainFileClose
src/vfs.c:1271
0x7ff3c75e7c12 ???
???:0
0x7ff3c75f7f42 ???
???:0
0x7ff3c757dc07 ???
???:0
0x7ff3c75ed197 ???
???:0
0x7ff3c762cbf9 ???
???:0
0x7ff3c74d3b26 stmt__close
src/stmt.c:22
0x7ff3c74d3b26 stmt__registry_close
src/stmt.c:32
0x7ff3c74c6752 gateway__leader_close_cb
src/gateway.c:56
0x7ff3c74ca205 leader_finalize
src/leader.c:98
0x7ff3c74cac07 exec_tick
src/leader.c:594
0x7ff3c74faec6 uvAsyncAfterWorkCb
src/raft/uv_work.c:33
0x7ff3c743ec41 ???
???:0
0x7ff3c74361ac ???
???:0
0x7ff3c74520fd ???
???:0
0x7ff3c743bc47 ???
???:0
0x7ff3c74d0b47 taskRun
src/server.c:711
0x7ff3c74d0b47 taskStart
src/server.c:774
0x7ff3c7294ac2 ???
???:0
0x7ff3c73268bf ???
???:0
0xffffffffffffffff ???
???:0
lxd: src/vfs.c:1271: vfsMainFileClose: Assertion `(f->database->shm.lock[0] == 0)' failed.
time="2025-11-12T19:33:28Z" level=warning msg="Dqlite proxy failed" err="first: remote -> local: read tcp 100.64.1.103:37882->100.64.1.101:8443: read: connection reset by peer" local="100.64.1.103:37882" name=raft remote="100.64.1.101:8443"
I know #858 isn't merged yet but maybe there is still some value to having another trace.
#858 is now in main. Next time this happens we should have a clear picture of what was happening during the crash.