exfat icon indicating copy to clipboard operation
exfat copied to clipboard

Deleting a .fuse-hidden file opened by a sigkilled process crashes "unable to cleanup a node with 1 references"

Open njjewers opened this issue 6 years ago • 5 comments

I've had some time to drill down on the issue I had earlier, and the filesystem corruption was a red herring. I can reliably crash mount.exfat-fuse 1.3.0 (libfuse version 2.9.9, linux kernel 4.4.38 w/fuse API 7.23, have not modified libfuse or fuse-exfat) on my aarch64 machine, by attempting to delete a .fuse_hidden file that corresponds to a file that was opened (either for reading or writing) by a sigkilled process. Notably, this crash does not seem to occur if the process is instead killed by any other signal.

Steps to reproduce:

  • Create a directory in the exfat filesystem
  • In a test process, open a file (for reading or writing) in the newly created directory
  • SIGKILL the test process with the file still open
  • rm -r that directory
    • Expected behaviour is that the directory and its contents are removed
    • Observed behaviour is that a .fuse_hidden file corresponding to the file the test process had open is created in the directory, and the directory is not removed as it is not empty
  • rm -r that directory again
    • Expected behaviour is that the directory and its contents are removed
    • Observed behaviour is that the rm process reports "Software caused connection abort" and mount.exfat-fuse reports "unable to cleanup a node with 1 references" and aborts. Future attempts to access the filesystem fail with "Transport endpoint is not connected"

In this log file I've attached, I create a new exfat filesystem and execute the above steps, leading to a crash. I provide the stack trace from the core file generated.

njjewers avatar Feb 22 '19 17:02 njjewers

Weird. I cannot reproduce this issue on my box (Fedora 29 x86_64, Linux 4.20.8, fuse 2.9.7).

When a process is killed, kernel must close all open file descriptors. Looks like this does not happen on your system: fuse renames testfile to .fuse_hidden* instead of removal because it thinks it's still open (fuse-exfat thinks the same).

I modified your example to remove testfile while the process that opened it is still alive, but this also works correctly on my system (i.e. flush, release and unlink callbacks of fuse-exfat are called after the process dies).

Please apply this patch to fuse-exfat:

diff --git a/fuse/main.c b/fuse/main.c
index c645390..69b21b5 100644
--- a/fuse/main.c
+++ b/fuse/main.c
@@ -35,7 +35,6 @@
 #include <unistd.h>
 
 #ifndef DEBUG
-       #define exfat_debug(format, ...)
 #endif
 
 #if !defined(FUSE_VERSION) || (FUSE_VERSION < 26)
@@ -498,7 +497,6 @@ static char* add_fuse_options(char* options, const char* spec, bool ro)
                return NULL;
 #endif
 #if defined(__linux__)
-       options = add_blksize_option(options, CLUSTER_SIZE(*ef.sb));
        if (options == NULL)
                return NULL;
 #endif
@@ -528,7 +526,6 @@ int main(int argc, char* argv[])
                        "big_writes,"
 #endif
 #if defined(__linux__)
-                       "blkdev,"
 #endif
                        "default_permissions");
        exfat_options = strdup("ro_fallback");

It makes debugging easier by enabling logging, avoiding losetup and running fuse-exfat in foreground:

sudo ./fuse/mount.exfat-fuse test.exfat /mountpoint -d

Please repeat your test case and post the log.

relan avatar Feb 22 '19 19:02 relan

After applying that patch, re-running the test case doesn't crash. Here are the logs of mount.exfat-fuse, and the commands that I ran to produce them. commandlog.txt exfatlog.txt

njjewers avatar Feb 22 '19 19:02 njjewers

Smells like a race condition. The logs look like the should look. Could you try a newer kernel? Also would be nice if you could automate reproduction with logging on and run it for some time.

relan avatar Feb 24 '19 06:02 relan

What would you be looking for in the automated test? Running the script in the logs I've shown causes the non-patched mount.exfat to crash every time, so I don't know how much useful information repeating the test will give. Do you want to see if the patched, non-blkdev mount.exfat will crash given enough repetitions?

I'll see if I can get some spare time to try a newer kernel, though.

njjewers avatar Feb 25 '19 15:02 njjewers

Do you want to see if the patched, non-blkdev mount.exfat will crash given enough repetitions?

Yes, I'd like to confirm that the file descriptor has not been closed after the process was killed.

I'll see if I can get some spare time to try a newer kernel, though.

This would help to isolate the bug.

relan avatar Feb 25 '19 17:02 relan