[BUG] - Node fails to start because `DBLocked` is thrown
External
Summary
In a recent run of antithesis we got an interesting failure: One of the nodes (out of a cluster of 5) crashed with the following error messages:
DbLocked "/state/lock"
cardano-node: The db is used by another process. File "/state/lock" is locked
According to antithesis' logs, it seems this happened when the node was restarted after having been killed.
Steps to reproduce
We only had one occurence of this error so it seems very hard to reproduce 😅
I suspect this happens because the file was not properly removed when the node stopped as there's no guarantee it's instantaneous when using file-based locks.
Expected behavior
Not sure
System info (please complete the following information):
- OS Name: Linux x86 (runs on Antithesis hypervisor)
- Consensus version: tested through cardano-node
cardano-node 10.5.1 - linux-x86_64 - ghc-9.6
git rev ca1ec278070baf4481564a6ba7b4a5b9e3d9f366
We would need some more information on how this bug was triggered. From the error message it seems like its not only that the file exists, but that it is file-locked still.
The file-locking package releases the files once the process that locked it is terminated, and in Consensus we even wait a couple of seconds if the file is locked.
This seems to imply that the previous process was still running and alive when the later one was started, so we would need more information about this.
Unfortunately the Antithesis reports are still not public, but I could walk you through it if you want and perhaps investigate a bit more what happened. I don't think the previous process was running, or perhaps was it lingering as a zombie 🤷 As you know, AT can reveal subtle and annoying bugs.
Here is another failure, this time you should be able to access the report: https://cardano.antithesis.com/report/uZOLNDnYB8uZNHti7Mtr4jYJ/i-xPW8J3f26-5jJazC5ItwaNTqFMhZNpm45RSbf7ahU.html?auth=v2.public.eyJuYmYiOiIyMDI1LTA5LTIxVDAxOjAwOjQ3LjU1NDg2NzkzNVoiLCJzY29wZSI6eyJSZXBvcnRTY29wZVYxIjp7ImFzc2V0IjoiaS14UFc4SjNmMjYtNWpKYXpDNUl0d2FOVHFGTWhaTnBtNDVSU2JmN2FoVS5odG1sIiwicmVwb3J0X2lkIjoidVpPTE5EbllCOHVaTkh0aTdNdHI0allKIn19fTQITnWPZvdWmx8UC14NPi4XvgnkJFppQaME3cfCzwIIi5hz7ot2mdoZ76oZhb2tUwBh0warnTuKm2_9nMw5vgE#/run/41f386c17addb138c57f7aa512aacf78-38-6/finding/0d34d536cc622ac1f8b81b6febdfd68df9de2eea As we'll probably meet next week, perhaps this is a good opportunity to investigate together the failure in the AT debugger?