rocksdb icon indicating copy to clipboard operation
rocksdb copied to clipboard

RocksDB deleted some sst files around the time of the shutdown, but later re-opening db complained about missing those files

Open xunlai opened this issue 3 years ago • 3 comments

Expected behavior

Normal shutdown of the DB, and re-opening of the DB should succeed.  

Actual behavior

Normal shutdown of the DB, but re-opening of the DB failed and complained about missing files due to deleted sst files.  

Steps to reproduce the behavior

I can't reproduce it easily.  But it happened sometimes in our production system.  Here is what I captured.   The last few lines of the LOG file showing deletion of sst files along with shutdown. 2021/11/13-20:33:37.843454 7febe0ded700 [/db_impl/db_impl.cc:803] Storing 159 stats with timestamp 1636835617 to in-memory stats history 2021/11/13-20:33:37.843554 7febe0ded700 [/db_impl/db_impl.cc:816] [Pre-GC] In-memory stats history size: 1055060 bytes, slice count: 142 2021/11/13-20:33:37.843639 7febe0ded700 [/db_impl/db_impl.cc:825] [Post-GC] In-memory stats history size: 1055060 bytes, slice count: 141 2021/11/13-20:38:54.978475 7fec913d9700 [/db_impl/db_impl.cc:447] Shutdown: canceling all background work 2021/11/13-20:38:55.004287 7fec913d9700 [le/delete_scheduler.cc:73] Deleted file /path/to/rocksdb/048559.sst immediately, rate_bytes_per_sec 0, total_trash_size 0 max_trash_db_ratio 0.250000 2021/11/13-20:38:55.004317 7fec913d9700 EVENT_LOG_v1 {"time_micros": 1636835935004312, "job": 7943, "event": "table_file_deletion", "file_number": 48559} 2021/11/13-20:38:55.008765 7fec913d9700 [le/delete_scheduler.cc:73] Deleted file /path/to/rocksdb/048526.sst immediately, rate_bytes_per_sec 0, total_trash_size 0 max_trash_db_ratio 0.250000 2021/11/13-20:38:55.008778 7fec913d9700 EVENT_LOG_v1 {"time_micros": 1636835935008776, "job": 7943, "event": "table_file_deletion", "file_number": 48526} 2021/11/13-20:38:55.013012 7fec913d9700 [le/delete_scheduler.cc:73] Deleted file /path/to/rocksdb/048510.sst immediately, rate_bytes_per_sec 0, total_trash_size 0 max_trash_db_ratio 0.250000 2021/11/13-20:38:55.013022 7fec913d9700 EVENT_LOG_v1 {"time_micros": 1636835935013020, "job": 7943, "event": "table_file_deletion", "file_number": 48510} 2021/11/13-20:38:55.017648 7fec913d9700 [le/delete_scheduler.cc:73] Deleted file /path/to/rocksdb/048497.sst immediately, rate_bytes_per_sec 0, total_trash_size 0 max_trash_db_ratio 0.250000 2021/11/13-20:38:55.017658 7fec913d9700 EVENT_LOG_v1 {"time_micros": 1636835935017656, "job": 7943, "event": "table_file_deletion", "file_number": 48497} 2021/11/13-20:38:55.130709 7fec913d9700 [/db_impl/db_impl.cc:625] Shutdown complete   When the DB was re-opened, it failed and complained about missing files because 048510.sst has been deleted.  When the sst file is missing, RocksDB falls back to look for the ldb file, which doesn’t exist either. 21:00:02.950 [pool-2-thread-2] ERROR c.e.c.d.d.c.impl.RocksDBConnection -  initDBs org.rocksdb.RocksDBException: Corruption: IO error: No such file or directoryWhile open a file for random read: /path/to/rocksdb/048510.ldb: No such file or directory at org.rocksdb.RocksDB.open(Native Method) at org.rocksdb.RocksDB.open(RocksDB.java:306)   We have a way to work around this issue because we took snapshot of the RocksDB instance.  When this issue happens, we copy the deleted sst files back to the RocksDB folder and re-opening can succeed.

After the re-opening, the files we copied back to work around the issue would be deleted in a normal way in compaction later.

We are using version 6.15.5.  My impression is the shutdown of the DB makes the deletion incomplete.  It deleted the files but somewhere the metadata was not updated.  The re-opening thought those files should still be available. Please advice and check the codes to see if any race condition during the shutdown can cause this issue. Thanks!!  

xunlai avatar Jan 22 '22 01:01 xunlai

This is not expected to happen. We only delete files who we've already marked being deleted in manifest. When you dump your manifest file, how does it mention file 48510? I assume some tail record of file deletion is missing?

siying avatar Jan 25 '22 20:01 siying

The ldb command shows the manifest file doesn't contain the four deleted files (48510 is one of them) mentioned in the LOG file.

xunlai avatar Jan 28 '22 16:01 xunlai

This might be related to #10258

dparrella avatar Sep 07 '22 18:09 dparrella

It also happens for us - Can't access /000041.sst: IO error: No such file or directorywhile stat a file for size:

btiwana12 avatar Jun 24 '23 00:06 btiwana12

We are having exactly this issue now and then.

2023/09/20-10:33:12.764015 28c4 RocksDB version: 6.3.6 ... 2023/09/20-10:33:12.780438 28c4 [ERROR] [version_set.cc:1937] Unable to load table properties for file 2301 --- IO error: No such file or directoryNewRandomAccessFile failed to Create/Open: C:\Data\000/002301.sst: The system cannot find the file specified.

This is serious, as it stops production environments from starting. Takes some time to analyze and find the workaround. Finally restoring the file from backup. Leading to several hours of downtime in production. This has occured several times during the last years, in different client production environments. It must be a high priority issue to solve, as it degrades the confidence in RocksDB. Making people think RocksDB is loosing files and data at random. I know this is not true, but others get this impression giving RocksDB bad reputation.

It seems to happen at Database Close. Maybe only in Windows Environments? Is there any way of doing the Close in a safer manner. Like detecting if a Compaction (or other RocksDB task) is in progress and wait for it to finish before calling DB Close ? Do you have any recommendation regarding a safer way to Close the DB to avoid this issue?

sweutv avatar Sep 21 '23 09:09 sweutv