btrfs-progs icon indicating copy to clipboard operation
btrfs-progs copied to clipboard

documentation: what errors are fatal?

Open jowagner opened this issue 3 years ago • 1 comments

There is a mount option fatal_errors=action but

  • no explanation what errors are considered fatal
  • no corresponding option for non-fatal errors
  • no action "report-and-continue"

I also didn't find anything relevant on fatal errors in the btrfs wiki. My google search to find out more was also unsuccessful.

The discussion in issue #451

Normally "all mirrors failed" means we immediately pull the host out of the service pool for maintenance

suggests that an inability to read a metadata block with correct checksum is a fatal error.

I always assumed, based on my past experience with I/O errors, that Linux isolates problems and tries to continue. Now I checked man pages of stat, open and opendir and I see they don't even list EIO as a possible error code. Only read and write have it. Does this mean that any unrecoverable metadata error is a fatal error? Are only errors in file data non-fatal?

jowagner avatar Mar 22 '22 15:03 jowagner

Fatal errors are conditions marked in the kernel source with the btrfs_panic function family. These are the current set of conditions:

  • extent tree modified while locked
  • backref cache inconsistency detected
  • bad key order when adding a new extent to an inode

The set of specific conditions changes from one kernel release to another. As the implementation changes, new conditions arise and old conditions are eliminated. The details are only relevant to btrfs kernel developers. In each case, a serious internal inconsistency was detected in kernel memory (i.e. not read from disk, but possibly computed from on-disk data), requiring btrfs to immediately stop execution to avoid catastrophic downstream effects.

The fatal_errors option allows the user to select whether btrfs_panic will stop a single kernel thread (by invoking BUG() to forcibly halt the thread that detected the fatal error condition) or stop the entire kernel (by invoking panic() to halt the system).

In either case, the error condition can only be cleared by rebooting. In the panic case, the kernel is no longer running, so reboot is the only possible path forward. In the BUG() case, the individual kernel thread has been terminated, but it has likely left behind locked objects that will block access by any other thread that attempts to modify the affected filesystem, including umounting it.

Continuing after btrfs_panic (or BUG() calls in general) will often result in immediate and severe filesystem damage and possibly exacerbate existing kernel memory corruption, so there's no option to continue after a fatal error. Panic is reserved for cases where the error is so severe that error recovery may be unreliable and possibly even dangerous to other parts of the kernel (e.g. other filesystems mounted on the same host).

Note that btrfs also has numerous calls to BUG() and BUG_ON() which are not affected by the mount option. These detect conditions that are only expected to arise as a result of kernel bugs.

Note also that the Linux kernel may be configured to globally translate any call to BUG() into panic() (e.g. using panic=oops command line option or set during kernel build-time configuration). In that case there is no effective difference between the fatal_error options, as they all ultimately end in kernel termination.

Does this mean that any unrecoverable metadata error is a fatal error?

Non-fatal errors are handled internally (e.g. correcting bad sectors from mirrors) or reported to applications (as IO errors for most common syscalls, or SIGBUS for shared mapped pages). These are normal behavior of the filesystem (e.g. if a user reads a data block and the underlying device returns EIO, btrfs simply passes the EIO status up the call stack) and have no special treatment.

In most cases it is possible for the filesystem to continue operating normally after a metadata read error, but any operation that relies on the metadata will not be able to proceed (e.g. if an interior node of a tree is lost, all parts of the filesystem beyond the lost node become inaccessible). Depending on which specific page is lost, the entire filesystem's contents may become inaccessible.

Some errors will force the filesystem into a read-only state, but are not severe enough to require forcibly stopping a kernel thread. Metadata ENOSPC and unrecoverable metadata write errors are common cases of this. In these cases neither BUG() nor panic() is called, so the filesystem may be recovered by umounting and mounting again, no reboot required (though if the filesystem on disk is corrupt or the disk is failing, it is likely the same error will immediately reoccur).

Are only errors in file data non-fatal?

That's more or less correct. Users should design assuming any uncorrectable errors in metadata will force the filesystem read-only, though in practice some are recoverable.

Disk failures can be corrected using resilient metadata (i.e. csums to detect the failures and mirror copies to replace the lost data), but errors from higher layers of the system may be fatal. These include errors from sources such as:

  • host hardware with direct access to alter the contents of btrfs kernel memory: CPU, RAM, DMA-capable peripherals
  • host hardware which may adversely affect the correct behavior of memory: memory buses, power supply, cooling subsystem
  • kernel bugs, either in btrfs or other parts of the kernel (e.g. use-after-free in a device driver)

Some metadata errors are recoverable, but require a sequence of umount, possible on-disk intervention with a tool like btrfs rescue zero-log, and mount to resynchronise kernel memory state with the on-disk filesystem. Generally btrfs recovers from errors by ignoring anything written to disk since the last completed transaction, but there's no reasonable way to report the resulting changes in filesystem content to user applications. Applications must discard any open file descriptors on the filesystem and rebuild their state based on the data that is present on disk, and the best available interface for that is to force all the applications to close their files, umount the filesystem, mount it again, and reopen the files (if they still exist).

Data errors are always trivially recoverable by deleting all references to the affected data extents from the filesystem (excluding inline extents, which are small files stored directly in metadata items).

Edit: more clearly distinguish between "fatal" as in "fatal_error" mount option, and "fatal" as in "really bad thing that happens to the filesystem, but that doesn't cripple the host that mounted it."

Zygo avatar Apr 01 '22 22:04 Zygo