rocksdb Failed to resume DB after "no space left on device" error.

It looks like RocksDB failed to resume after hitting "no space left on device". "no space left on device" error is marked as an HardError, which according to Background-Error-Handling should be recoverable back to read-write mode after the issue has been addressed.

Unfortunately, @siying has introduced 10489 which adds a flag in WritableFileWriter that remembers an error has occurred. Resuming the DB after no space error doesn't clear the flag, causing successive write failures, even though the recovery was completed successfully.

Expected behavior

RocksDB should be able to recover after not enough disk space error, either by calling Resume manually or using auto recovery.

Actual behavior

Both auto recovery and calling resume manually succeed but any further writes fails.

Steps to reproduce the behavior

Consider the following code:

#include <iostream>
#include "rocksdb/db.h"
#include <unistd.h>
class StorageExtender : public rocksdb::EventListener
{
public:
    bool isRecovered = false;
    StorageExtender() = default;
    void OnErrorRecoveryBegin(
        rocksdb::BackgroundErrorReason reason,
        rocksdb::Status bg_error,
        bool* /*auto_recovery*/
    ) {
        std::cout << "Got error:" << bg_error.ToString() << std::endl;
        if (bg_error.IsNoSpace()) {
            system("mount -o remount,size=256M /mnt/mytmpfs");
        }
    }
    void OnErrorRecoveryCompleted(rocksdb::Status old_bg_error) {
        std::cout << "Recovered from error:" << old_bg_error.ToString() << std::endl;
        isRecovered = true;
    }
};
int main()
{
    auto storageExtender = std::make_shared<StorageExtender>();
    rocksdb::DB* db{};
    rocksdb::Options options;
    options.listeners.push_back(storageExtender);
    options.create_if_missing = true;
    system("umount /mnt/mytmpfs");
    system("mount -t tmpfs -o size=1024K tmpfs /mnt/mytmpfs");
    rocksdb::Status s = rocksdb::DB::Open(options, "/mnt/mytmpfs", &db);
    if (!s.ok()) {
        std::cout << s.ToString() << std::endl;
        return 1;
    }
    rocksdb::WriteBatch wb;
    for (int i = 0; i < 1024 * 1024 * 5; ++i) {
        auto kv = std::to_string(i);
        s = wb.Put(kv, kv);
    }
    rocksdb::WriteOptions wo;
    s = db->Write(wo, &wb);
    std::cout << s.ToString() << std::endl;
    // system("mount -o remount,size=32768K /mnt/mytmpfs");
    std::cout << "Waiting for recovery to complete" << std::endl;
    while (!storageExtender->isRecovered) {
        std::cout << "."  << std::endl;
        sleep(1);
    }
    std::cout << "Done" << std::endl;
    s = db->Write(wo, &wb);
    std::cout << s.ToString() << std::endl;
    return 0;
}

In the above code example, I expect the DB to be writeable after a disk space has been freed, but it keeps failing due to the seen_error_ flag in WritableFileWriter.

Notes:

It looks similar to 9762, but has a wider effect as it happens not only with PessimisticTransactionDB.
A possible solution is to call logs_.back().writer->file()->reset_seen_error(); in ResumeImpl, but I'm not familiar enough with rocks error handling, and I'm not sure what are the consequences of such change.

Jul 26 '23 12:07 assafka

we hit this too, any update?

Jun 12 '24 03:06 lycplus

I don't think we can simply reset error on the last WAL. In this repro, nothing is written to memtable and WAL has a partial write batch record. After error is cleared, new writes will append to this WAL after the corrupted record, and may not be recovered after reopening.

Jun 22 '24 18:06 cbi42

I don't think we can simply reset error on the last WAL. In this repro, nothing is written to memtable and WAL has a partial write batch record. After error is cleared, new writes will append to this WAL after the corrupted record, and may not be recovered after reopening.

where in the code do we insert the partial record? maybe we should undo the inserting when encountering no space error?

Jun 25 '24 02:06 lycplus

where in the code do we insert the partial record? maybe we should undo the inserting when encountering no space error?

Sorry for the late reply. When writing a write batch to WAL, the write batch can be broken down into several WAL records and several writes. This happens in https://github.com/facebook/rocksdb/blob/b6c3495a7183f01901d3be01dc68f7e40a1a2e9b/db/log_writer.cc#L91. WAL record formats are explained in log_writer.h.

I think for auto-recovery, we should make sure not to write to this WAL anymore and make sure this WAL is not needed for this DB anymore. For the latter, we do flush during auto-recovery, but in this case memtable could be empty and flush is not done.

Jul 07 '24 22:07 cbi42