zig
zig copied to clipboard
Use posix semantics when deleting files on Windows
When a file is deleted on Windows, it may not be immediately removed from the directory entry. This can cause problems with future scans of that directory, which will see the partially deleted file. However, it should be noted that under some workloads and system configurations, Windows files may appear to be deleted immediately. This seems to be why the CI is not failing under this bug.
This change forces file deletion to use posix semantics, removing it from the directory immediately on delete.
This fixes a bug that occurs on some windows systems/hard drives where std.fs.Dir.deleteTree fails due to assumed posix semantics. This may also fix rare bugs in other systems (like the cache system), which may assume posix semantics for file deletion.
Looks like the mac os test failed for an unrelated reason? I think this is good to go, unless there's some connection I'm missing.
I think this is the correct approach rather than work around Windows changing file semantics in Windows 10 (found it here https://news.ycombinator.com/item?id=23745019) https://stackoverflow.com/questions/60424732/did-the-behaviour-of-deleted-files-open-with-fileshare-delete-change-on-windows
The workaround with renaming sounds more cumbersome. The other used approach in the linked repo did reportedly still have the same behavior https://github.com/XAMPPRocky/remove_dir_all/issues/7 as of d2248555dd1c8aa08327a75fc024b0520f4afe22, for example in https://github.com/cargo-generate/cargo-generate/issues/619.
One annoyance of all those issues is that reporters did not provide a script + program to reproduce.
A few notes:
- Unfortunately the macOS failure does look relevant. Looks to be a separate self-hosted compiler bug that is revealed when trying to compile the new code in this patch. I'm happy to look into this as a prerequisite fix before merging this PR.
- Can we please document clearly the intended semantics of DeleteFile in doc comments for this function, so that we know what it is supposed to do - particularly with regards to this subtle Windows behavior. This is mostly on me for not doing this documentation originally, but better late than never. Also let's make sure the semantics are clear in the cross-platform function
std.fs.Dir.deleteFile. - I'm concerned that this code which deletes a file, a common file system operation, will now take two syscalls instead of one. I don't fully understand why this is necessary and I'm not yet convinced that we have to give up on this one. Are there circumstances where the former behavior is acceptable? Is this perhaps a case where the API can bifurcate, allowing the caller to request the semantics they need, so that an extra syscall is avoided when possible?
Related to / fixes (?) https://github.com/ziglang/zig/issues/6452
Does not affect https://github.com/ziglang/zig/issues/5537
Also worth noting that this requires Windows 10 >= 1607 (_WIN32_WINNT_WIN10_RS1)
Related to / fixes (?) https://github.com/ziglang/zig/issues/6452
Good find, that's the stack trace that is currently blocking me, which is why I made this PR. This triggers consistently on my computer when I try to run compiler tests (deleteTree is used by std.testing.tmpDir), so I cannot run tests locally until this is fixed.
I'm concerned that this code which deletes a file, a common file system operation, will now take two syscalls instead of one.
It's actually three now instead of two, deleting a file already requires NtCreateFile + CloseHandle. NtCreateFile is slow on Windows so the overhead of an extra syscall is not really in the same ball park. That said, we could have a different function which uses DOS semantics if needed.
I don't fully understand why this is necessary and I'm not yet convinced that we have to give up on this one.
std.fs.Dir.deleteTree makes the implicit assumption that deleted files do not appear when iterating over directories, so it does not work correctly on Windows systems where the deletion is not immediate (causing #6452). Whether deletion is immediate can depend on things like the current behavior of the virus scanner, the capacity of the hard drive, and the general speed of the processor, so it may not be visible on all systems (like the CI machine, apparently).
Are there circumstances where the former behavior is acceptable? Is this perhaps a case where the API can bifurcate, allowing the caller to request the semantics they need, so that an extra syscall is avoided when possible?
There probably are circumstances where DOS semantics are ok, however given deleteTree there are probably other places in the std lib that assume posix semantics. We would need to audit them all and decide if we can switch to the DOS semantics. This is especially difficult because dev machines tend to be fast and may not properly repro DOS semantics (the time between deletion and removal from the directory is nonzero but very small), so using DOS semantics may lead to unreliable and difficult-to-reproduce bugs that only manifest on low-end hard drives or in production.
IMO the default should be posix semantics, with a potential opt-in to DOS semantics to avoid the syscall if performance is worth introducing a possible race condition.
Also worth noting that this requires Windows 10 >= 1607 (
_WIN32_WINNT_WIN10_RS1)
Ah, that's a problem. We probably need a different implementation then. It might be worth looking at another implementation like boost::fs, which implements posix semantics on windows.
Ah, that's a problem. We probably need a different implementation then. It might be worth looking at another implementation like boost::fs, which implements posix semantics on windows.
https://github.com/boostorg/filesystem/blob/fcc11010a5899d6a16d3f0a9d60704f88d62eada/src/operations.cpp#L1421
As I understand it, Zig currently has no bindings to SetFileInformationByHandle, which is heavily used in boost and suggested ie here https://stackoverflow.com/questions/36450222/moving-a-file-using-setfileinformationbyhandle
and generally to do atomic file stuff on Windows (by checking the resulting errors etc).
~~For some reason boost defaults to use file_disposition_info_class, whereas POSIX semantics offers the distinguishment between file_disposition_info_ex_class and file_basic_info_class as internal variables.~~
~~Other than that the implementations look relative identical.~~
Thus my questions:
-
- What is the expected behavior, if we get to a read-only file? This should be also specified in
deleteTree.
- What is the expected behavior, if we get to a read-only file? This should be also specified in
-
- ~~boost does retry setting file attributes~~ ~~(prone to failure, if file is replaced on filesystem while operation is executing)~~
see relevant part of
git show 7403ffca00b788c978d2de11c5148f5fa58555de: below. ~~Should we do anything like this to hack around the problem or what is the expected behavior?~~ ~~I dont understand why they do it for Vista and later. Ideas?~~
- ~~boost does retry setting file attributes~~ ~~(prone to failure, if file is replaced on filesystem while operation is executing)~~
see relevant part of
-
- How can we properly test this? I feel like a few proper tests on according hardware (ie a potato) could provide stronger confidence and allow us to compare perf (loss) of the solutions. For starter points, look here: https://github.com/boostorg/filesystem/issues/216
The implementation uses runtime detection of the feature in the OS.
We are also using two more implementations for file removal: one that
employs the more recent FILE_DISPOSITION_FLAG_IGNORE_READONLY_ATTRIBUTE
flag (available since Windows 10 1809), and FILE_DISPOSITION_INFO
structure (supported since Windows Vista). The former allows to optimize
removal of read-only files, and the latter allows to make file deletion
atomic (i.e. not prone to failure if the file is replaced on the filesystem
while the operation is executing). The implementation is chosen in
runtime, depending on which one succeeds removing a file.
Also, added support for deleting read-only directories, in addition
to non-directory files, and simplified code a little.
Closes https://github.com/boostorg/filesystem/issues/216.
Update: Crossed out clarified parts. Update2: Crossed out more clarified parts (see comment below).
The relevant parts of boost (without functionality fallback) work like the following:
1. set_file_information_by_handle with file_disposition_info_ex_class for deletion; break on success;
2. ifnot ERROR_ACCESS_DENIED goto error;
3. get_file_information_by_handle_ex with file_basic_info_class (basic file info like access time and access permissions); goto error on failure;
4. set_file_information_by_handle without READ-ONLY; goto error on failure;
5. set_file_information_by_handle with file_disposition_info_ex_class for deletion; break on success;
6. set_file_information_by_handle with READ-ONLY; goto error;
Thus the (only?) remaining question is, how we should deal with read-only files.
Context https://docs.microsoft.com/en-us/windows/win32/api/minwinbase/ne-minwinbase-file_info_by_handle_class https://docs.microsoft.com/en-us/windows-hardware/drivers/ddi/ntifs/nf-ntifs-ntquerydirectoryfile
Update: Clarified wording.
Also worth noting that this requires Windows 10 >= 1607 (
_WIN32_WINNT_WIN10_RS1)
According to Microsoft, Windows 8.1 reaches end of Extended Support on January 10, 2023. source
Also worth noting that this requires Windows 10 >= 1607 (
_WIN32_WINNT_WIN10_RS1)According to Microsoft, Windows 8.1 reaches end of Extended Support on January 10, 2023. source
Not sure if/how this factors in, but needing Windows 10 >= 1607 means that builds of Windows 10 earlier than that will also not have the feature; i.e it's not new with Windows 10, it was added in a particular build of Windows 10.
It might be worth looking at another implementation like boost::fs, which implements posix semantics on windows.
Also useful: https://github.com/python/cpython/issues/84324
This PR can still be merged if a target OS version check is added. builtin.target.os.version_range.windows.min.isAtLeast(.win10_rs1)
Unfortunately this was never quite merge-ready. Any contributor is welcome to resurrect this PR against latest master. As far as I know it only needs a OS min version check added.