e2fsprogs
e2fsprogs copied to clipboard
Is there an error code description for e2fsck?
I am making a tool for verifying the file system based on e2fsck, but this tool does not need to be as strict as the official one. When the file system cannot be mounted and must be repaired before it can be used, the tool will return an error code. When the file system has only some statistical errors, the tool can still pass the verification.
I tried to modify the source code, but I found that there are more than 300 error codes. I don’t know which error codes do not affect the use of the file system and which error codes affect. Can anyone help me?
What is your goal? Is it to speed up the time to do a file system check? Is it to be able to run a check with the file system is mounted, and the goal is to determine whether the needs system maintenance needs to be scheduled NOW NOW NOW, or whether it's OK to let things slide? Do you care about read-only access, or safe read-write access? Is it acceptable for the tool to say, "Hurr, Durr, who cares, good enough for government work", and then user data gets lost or the system goes down as a result?
The problem is that sometimes a particular "minor" issue might not be a big issue in itself, but it might be a hint that there something more catastrophic hiding. For example, a tiny puff of smoke being emitted on the side of shuttle where it was not expected, when the space shuttle takes off, might not be a big deal initially, but could lead to catastrophic loss of loss life a few days later.
So if you want a tool to make more relaxed checks, the question is are you trying to reduce runtime for the check? (Perhaps to speed up boot time?) Trying to skip doing needed system maintenance? And what kind of risks are you willing to embrace in order to achieve these goals?
Sorry I didn't make it clearer.
My purpose is to relax the calibration standards of tools. When the file system can be mounted in read-only mode and the existing contents of the file system can be viewed, the verification can pass, otherwise the verification fails.
For example, there is a PR_5_INODE_USED error, but this error does not affect the file system to be mounted in read-only mode, so the tool can ignore the occurrence of this error.
In order to achieve this goal, I can take the risk of some inconsistency errors in the file system, as long as these errors do not affect viewing the existing contents of the file system.
So let's follow Sakichi Toyoda "5 Why's" analysis tool. Why are you trying to relax the calibration standards of tools?
What is the purpose of this? If fsck can fix the file system, why not have it fix it? Why spend 99.9% of the time that it would take just to fix the darned thing, to answer the question, "its' OK to mount the file system read-only"?
Because I am doing something like this. I need to monitor someone else's file system and alert him when his file system is abnormal. But I don't have permission to modify other people's file system, so I can only open it as read-only. Because other people may be using the file system, I often check PR_5_INODE_USED errors with e2fsck, but in fact, this error does not affect the use of others.
My purpose is to alert the user when there is a very serious fatal error in the file system. Like some statistical errors, it can be considered that the user is performing file system operations.
Is the real issue that you want to check the file system while it is mounted, and there are a set of errors that are very likely to happen because the file system is in use at the time when you are mounting the file system? This is why I am asking why are you trying do this. Because if this is the answer, the correct answer is to use e2scrub, and take a read-only snapshot of the file system, and then run e2fsck on the read-only snapshot. That way, you are not modifying the file system, and since the snapshot is taken by briefly freezing the file system so that a coherent snapshot can be taken, you don't need to worry about spurious errors caused by the fact that you are trying to check the file system while it is mounted and in use.
That's because there are a number of file system corruptions that are caused simply by the fact that you got unlucky when you ran e2fsck on a mounted file system, and might very well be corruptions that would be "dangerous" to mount the file system, even read-only, if it were a "real" corruption, as opposed to a spurious issue caused by the fact that you were trying to check a file system while it was mounted.
Note that you can give an answer like "because", but I don't have to waste my time helping you if you aren't willing to tell me why. I really hate wasting my time when the problem is the user is asking the wrong question. This is why Zen Bhuddist masters would be known to rap students on the head when the real issue was they were so confused/stupid/unenlightened that they were asking the wrong question. Answering their question either "yes" or "no" would both be wrong, because the question was wrong.
In my project, a silent error may occur on the hard disk storing the user's file system, which may cause the file system metadata to be abnormal. When the user reads the abnormal data, the wrong content is read and written down again, which may occur more wrong operations. I just want to find this kind of error first. By the way, do you have any good solutions for this scenario?
If an error is detected because the file system is being mounted and used, there is a high probability that such an error will not occur multiple times in a row. I will alert the user when the same error occurs multiple times in a row. This way Can largely avoid my unlucky.
What kind of hard drive is this? Consumer HDD's have an Unrecoverable Bit Error Rate (UBER) of 10**-14. For Enterprise HDD's the UBER is 10**-15. And these numbers are conservative, and understate the actual "real world" error rate[1].
[1] https://heremystuff.wordpress.com/2020/08/25/the-case-of-the-12tb-ure/
In addition, for modern ext4 file systems (anything created using e2fsprogs 1.44 or newer --- and e2fsprogs 1.44 was released four years ago, in 2017) --- has metadata_csum enabled by default. That means that all metadata blocks are protected using checksums, and so the file system will automatically detect them whenever a metadata block is read. At that point, the kernel can do one of a number of different things. The default, for backwards compatibility reasons, is to just log the message, and continue. Another option is to remount the file system read-only, which will prevent things from getting worse, but this since many applications may not properly check error returns, this could cause them to malfunction in surprising ways. So the third option is to force a reboot so that fsck can check and repair the file system. In situations where there is a high availability system where there is a backup system which can take over when the primary system is unavailable, forcing a panic or reboot when a file system inconnsistency (including a metadata checksum failure being detected) may be the right thing.
So in answer to your specific question --- how do we worry about metadata blocks being written incorrectly and not being detected ---- this is actually not a real problem, first because in actual practice, such errors are vanishingly rare, and most of the time the hard drive will detect it and return an I/O error, as opposed to silently corrupting data. But if that does happen, ext4 has metadata block checksums that will detect those problems as soon as the metadata block is returned from the HDD to the file system.
As for your second paragraph, "If an error is detected because the file system is being mounted and used, there is a high probability that such an error will not occur multiple times in a row. I will alert the user when the same error occurs multiple times in a row. This way Can largely avoid my unlucky." ---- this makes no sense to me. What the heck do you mean by "such an error will not occur multiple times in a row"?!? If a metadata block is corrupted when it is written, then when it is read back, you'll get the same bad value each time, so absolutely the error will always happen. For example, if the root inode is corrupted, then every single time the file system is mounted, the exact same error will be reported. And we already will report such an error to the user, or at least to the kernel logs.
So for example, this sequence:
# mke2fs -Fq -t ext4 /tmp/foo.img 8m
# debugfs -w -R "clri <2>" /tmp/foo.img
# mount -o loop /tmp/foo.img /mnt
... where the debugfs command modules a specific kind of file system corruption which could have been induced by a hard drive write failure (although this is highly unlikely), reliably and repeatedly result in the same set of kernel messages whenever you try to mount the file system:
EXT4-fs error (device loop0): ext4_fill_super:4942: inode #2: comm mount: iget: root inode unallocated
EXT4-fs (loop0): get root inode failed
EXT4-fs (loop0): mount failed
Or for more yuks, we could emulate a inode checksum for the root file system like this:
# mke2fs -Fq -t ext4 /tmp/foo.img 8m
# debugfs -w -R "set_inode_field <2> checksum 0" /tmp/foo.img
# mount -o loop /tmp/foo.img /mnt
... and then you will get the following kernel logs:
EXT4-fs error (device loop0): ext4_fill_super:4942: inode #2: comm mount: iget: checksum invalid
EXT4-fs (loop0): get root inode failed
EXT4-fs (loop0): mount failed
So let me do another "5 whys" question. WHY are you trying to worry about failed writes, especially for metadata blocks. (Note that 99.99% of the blocks on a typical disk are data blocks, not metadata blocks.) Is this for some kind of stupid theoretical classroom project? Are you using some kind of hyper-unreliable storage device? If so, why? What you seem to be designing for makes no sense whatsoever.