MonetDB Occasional dbfarm corruption upon database restart

Describe the bug For the third time, on different databases, it happened that a properly shut down database would not restart, with the following errors found in merovingian.log:

2021-06-28 11:33:10 MSG merovingian[7]: database 'equip-vc_default01' (-1) has exited with exit status 0
2021-06-28 11:33:20 MSG merovingian[7]: starting database 'equip-vc_default01', up min/avg/max: 1s/1d/6d, crash average: 0.00 0.40 0.13 (8-4=4)
2021-06-28 11:33:21 MSG equip-vc_default01[51470]: arguments: /opt/monetdb/bin/mserver5 --dbpath=/var/lib/monetdb/dbfarm/equip-vc_default01 --set merovingian_uri=mapi:monetdb://f4c5ad81e6df:50000/equip-vc_default01 --set mapi_listenaddr=none --set mapi_usock=/var/lib/monetdb/dbfarm/equip-vc_default01/.mapi.sock --set
 monet_vault_key=/var/lib/monetdb/dbfarm/equip-vc_default01/.vaultkey --set gdk_nr_threads=8 --set max_clients=64 --set sql_optimizer=sequential_pipe --set embedded_py=3 --set mal_for_all=yes
2021-06-28 11:33:21 ERR equip-vc_default01[51470]: #main thread: BBPcheckbats: !ERROR: BBPcheckbats: cannot stat file /var/lib/monetdb/dbfarm/equip-vc_default01/bat/05/513.tail (expected size 18536): No such file or directory
2021-06-28 11:33:23 MSG merovingian[7]: database 'equip-vc_default01' (-1) has exited with exit status 1

Storage is local SSD, I tend to exclude related issues.

To Reproduce Unfortunately I am not able to reproduce it reliably. I can only say it never happened before Oct2020, and now it already happened 3 times, so I guess there is a bug in the storage layer triggered by some corner-case. I know it's hard to find the cause without a test, I just hope this can ring a bell.

Software versions

11.39.18
CentOS 7
compiled from sources

Jul 09 '21 12:07 swingbit

this is a known problem that occasionally happens. unfortunately, we've never got sufficient information to be able to find its cause. if you can give us any more information, we're more than happy to investigate.

A workaround (don't forget to create a backup of the current dbfarm first!) is to create the missing file and fill it with dummy data until it has reached the expected size. the BBP.dir file should tell you the data type. in this way, you can at least get the database restarted to save the remaining data.

Jul 12 '21 13:07 yzchang

Thanks Jennie. Good to know you are aware of it.

Just a thought, wouldn't it be useful to make the workaround automatic, and then inform the user that tables x,y,w are corrupt?

Jul 13 '21 07:07 swingbit

Maybe it's not much, but something else I noticed:

the problem occurs quite frequently on Oct2020
it seems to happen regularly after the database has filled up the disk. Stop, start, error. Today this happened twice, on two different databases.

Jul 22 '21 16:07 swingbit

checked in a possible fix (rolled forward changes from jun branch)

Aug 27 '21 09:08 njnes

We made some fixes recently on the Jul2021 branch. Please check once the Jul2021-SP1 comes out if this still happens.

Sep 29 '21 18:09 PedroTadim

Unfortunately this still happens (Jan2022, git head).

A seemingly fine database was running on a system that somehow was leaking 11G of disk space. df was reporting a partition usage 11G higher than what du reported. As soon as I stopped the db, the missing 11G were freed and suddenly df and du agreed.

Then when I try to restart the db, it refused with:

#main thread: BBPcheckbats: !ERROR: cannot stat file /var/lib/monetdb/dbfarm/default01/bat/02/11/21122.theap: No such file or directory

So this is most likely the file that held those 11G. It had already been deleted from disk, but it was still open in MonetDB.

I'm not sure what I can do to help debug this, but it is quite serious.

Feb 08 '22 16:02 swingbit

When you notice this again on a still running database, could you attach a debugger and call BBPdump() from the debugger? This function writes information about all known BATs to stderr, so hopefully the server's stderr goes somewhere. To be safe with respect to other threads running during this, you could do this sequence:

set scheduler-locking on
call BBPdump()
set scheduler-locking off

It would then be interesting to correlate the output with the files present in the database, so if you could also list all files inside the database at the same time (i.e. when the server is stopped in the debugger), and upload those two results, that would (hopefully) be helpful.

Feb 08 '22 16:02 sjoerdmullender