zig icon indicating copy to clipboard operation
zig copied to clipboard

Use posix semantics when deleting files on Windows

Open SpexGuy opened this issue 3 years ago • 11 comments
trafficstars

When a file is deleted on Windows, it may not be immediately removed from the directory entry. This can cause problems with future scans of that directory, which will see the partially deleted file. However, it should be noted that under some workloads and system configurations, Windows files may appear to be deleted immediately. This seems to be why the CI is not failing under this bug.

This change forces file deletion to use posix semantics, removing it from the directory immediately on delete.

This fixes a bug that occurs on some windows systems/hard drives where std.fs.Dir.deleteTree fails due to assumed posix semantics. This may also fix rare bugs in other systems (like the cache system), which may assume posix semantics for file deletion.

SpexGuy avatar Jun 04 '22 22:06 SpexGuy

Looks like the mac os test failed for an unrelated reason? I think this is good to go, unless there's some connection I'm missing.

SpexGuy avatar Jun 05 '22 02:06 SpexGuy

I think this is the correct approach rather than work around Windows changing file semantics in Windows 10 (found it here https://news.ycombinator.com/item?id=23745019) https://stackoverflow.com/questions/60424732/did-the-behaviour-of-deleted-files-open-with-fileshare-delete-change-on-windows

The workaround with renaming sounds more cumbersome. The other used approach in the linked repo did reportedly still have the same behavior https://github.com/XAMPPRocky/remove_dir_all/issues/7 as of d2248555dd1c8aa08327a75fc024b0520f4afe22, for example in https://github.com/cargo-generate/cargo-generate/issues/619.

One annoyance of all those issues is that reporters did not provide a script + program to reproduce.

matu3ba avatar Jun 05 '22 12:06 matu3ba

A few notes:

  • Unfortunately the macOS failure does look relevant. Looks to be a separate self-hosted compiler bug that is revealed when trying to compile the new code in this patch. I'm happy to look into this as a prerequisite fix before merging this PR.
  • Can we please document clearly the intended semantics of DeleteFile in doc comments for this function, so that we know what it is supposed to do - particularly with regards to this subtle Windows behavior. This is mostly on me for not doing this documentation originally, but better late than never. Also let's make sure the semantics are clear in the cross-platform function std.fs.Dir.deleteFile.
  • I'm concerned that this code which deletes a file, a common file system operation, will now take two syscalls instead of one. I don't fully understand why this is necessary and I'm not yet convinced that we have to give up on this one. Are there circumstances where the former behavior is acceptable? Is this perhaps a case where the API can bifurcate, allowing the caller to request the semantics they need, so that an extra syscall is avoided when possible?

andrewrk avatar Jun 05 '22 19:06 andrewrk

Related to / fixes (?) https://github.com/ziglang/zig/issues/6452

Does not affect https://github.com/ziglang/zig/issues/5537

Also worth noting that this requires Windows 10 >= 1607 (_WIN32_WINNT_WIN10_RS1)

squeek502 avatar Jun 05 '22 21:06 squeek502

Related to / fixes (?) https://github.com/ziglang/zig/issues/6452

Good find, that's the stack trace that is currently blocking me, which is why I made this PR. This triggers consistently on my computer when I try to run compiler tests (deleteTree is used by std.testing.tmpDir), so I cannot run tests locally until this is fixed.

I'm concerned that this code which deletes a file, a common file system operation, will now take two syscalls instead of one.

It's actually three now instead of two, deleting a file already requires NtCreateFile + CloseHandle. NtCreateFile is slow on Windows so the overhead of an extra syscall is not really in the same ball park. That said, we could have a different function which uses DOS semantics if needed.

I don't fully understand why this is necessary and I'm not yet convinced that we have to give up on this one.

std.fs.Dir.deleteTree makes the implicit assumption that deleted files do not appear when iterating over directories, so it does not work correctly on Windows systems where the deletion is not immediate (causing #6452). Whether deletion is immediate can depend on things like the current behavior of the virus scanner, the capacity of the hard drive, and the general speed of the processor, so it may not be visible on all systems (like the CI machine, apparently).

Are there circumstances where the former behavior is acceptable? Is this perhaps a case where the API can bifurcate, allowing the caller to request the semantics they need, so that an extra syscall is avoided when possible?

There probably are circumstances where DOS semantics are ok, however given deleteTree there are probably other places in the std lib that assume posix semantics. We would need to audit them all and decide if we can switch to the DOS semantics. This is especially difficult because dev machines tend to be fast and may not properly repro DOS semantics (the time between deletion and removal from the directory is nonzero but very small), so using DOS semantics may lead to unreliable and difficult-to-reproduce bugs that only manifest on low-end hard drives or in production.

IMO the default should be posix semantics, with a potential opt-in to DOS semantics to avoid the syscall if performance is worth introducing a possible race condition.

Also worth noting that this requires Windows 10 >= 1607 (_WIN32_WINNT_WIN10_RS1)

Ah, that's a problem. We probably need a different implementation then. It might be worth looking at another implementation like boost::fs, which implements posix semantics on windows.

SpexGuy avatar Jun 05 '22 22:06 SpexGuy

Ah, that's a problem. We probably need a different implementation then. It might be worth looking at another implementation like boost::fs, which implements posix semantics on windows.

https://github.com/boostorg/filesystem/blob/fcc11010a5899d6a16d3f0a9d60704f88d62eada/src/operations.cpp#L1421

As I understand it, Zig currently has no bindings to SetFileInformationByHandle, which is heavily used in boost and suggested ie here https://stackoverflow.com/questions/36450222/moving-a-file-using-setfileinformationbyhandle and generally to do atomic file stuff on Windows (by checking the resulting errors etc).

~~For some reason boost defaults to use file_disposition_info_class, whereas POSIX semantics offers the distinguishment between file_disposition_info_ex_class and file_basic_info_class as internal variables.~~

~~Other than that the implementations look relative identical.~~

Thus my questions:

    1. What is the expected behavior, if we get to a read-only file? This should be also specified in deleteTree.
    1. ~~boost does retry setting file attributes~~ ~~(prone to failure, if file is replaced on filesystem while operation is executing)~~ see relevant part of git show 7403ffca00b788c978d2de11c5148f5fa58555de: below. ~~Should we do anything like this to hack around the problem or what is the expected behavior?~~ ~~I dont understand why they do it for Vista and later. Ideas?~~
    1. How can we properly test this? I feel like a few proper tests on according hardware (ie a potato) could provide stronger confidence and allow us to compare perf (loss) of the solutions. For starter points, look here: https://github.com/boostorg/filesystem/issues/216
    The implementation uses runtime detection of the feature in the OS.
    We are also using two more implementations for file removal: one that
    employs the more recent FILE_DISPOSITION_FLAG_IGNORE_READONLY_ATTRIBUTE
    flag (available since Windows 10 1809), and FILE_DISPOSITION_INFO
    structure (supported since Windows Vista). The former allows to optimize
    removal of read-only files, and the latter allows to make file deletion
    atomic (i.e. not prone to failure if the file is replaced on the filesystem
    while the operation is executing). The implementation is chosen in
    runtime, depending on which one succeeds removing a file.

    Also, added support for deleting read-only directories, in addition
    to non-directory files, and simplified code a little.

    Closes https://github.com/boostorg/filesystem/issues/216.

Update: Crossed out clarified parts. Update2: Crossed out more clarified parts (see comment below).

matu3ba avatar Jun 07 '22 21:06 matu3ba

The relevant parts of boost (without functionality fallback) work like the following:

1. set_file_information_by_handle with file_disposition_info_ex_class for deletion; break on success;
2. ifnot ERROR_ACCESS_DENIED goto error;
3. get_file_information_by_handle_ex with file_basic_info_class (basic file info like access time and access permissions); goto error on failure;
4. set_file_information_by_handle without READ-ONLY; goto error on failure;
5. set_file_information_by_handle with file_disposition_info_ex_class for deletion; break on success;
6. set_file_information_by_handle with READ-ONLY; goto error;

Thus the (only?) remaining question is, how we should deal with read-only files.

Context https://docs.microsoft.com/en-us/windows/win32/api/minwinbase/ne-minwinbase-file_info_by_handle_class https://docs.microsoft.com/en-us/windows-hardware/drivers/ddi/ntifs/nf-ntifs-ntquerydirectoryfile

Update: Clarified wording.

matu3ba avatar Jun 08 '22 14:06 matu3ba

Also worth noting that this requires Windows 10 >= 1607 (_WIN32_WINNT_WIN10_RS1)

According to Microsoft, Windows 8.1 reaches end of Extended Support on January 10, 2023. source

andrewrk avatar Oct 17 '22 22:10 andrewrk

Also worth noting that this requires Windows 10 >= 1607 (_WIN32_WINNT_WIN10_RS1)

According to Microsoft, Windows 8.1 reaches end of Extended Support on January 10, 2023. source

Not sure if/how this factors in, but needing Windows 10 >= 1607 means that builds of Windows 10 earlier than that will also not have the feature; i.e it's not new with Windows 10, it was added in a particular build of Windows 10.

squeek502 avatar Oct 17 '22 22:10 squeek502

It might be worth looking at another implementation like boost::fs, which implements posix semantics on windows.

Also useful: https://github.com/python/cpython/issues/84324

andrewrk avatar Oct 17 '22 23:10 andrewrk

This PR can still be merged if a target OS version check is added. builtin.target.os.version_range.windows.min.isAtLeast(.win10_rs1)

andrewrk avatar Oct 17 '22 23:10 andrewrk

Unfortunately this was never quite merge-ready. Any contributor is welcome to resurrect this PR against latest master. As far as I know it only needs a OS min version check added.

andrewrk avatar Feb 16 '23 23:02 andrewrk