directory_handle::read is stuck
I keep running into problems with this function. It doesn't happen for all folders but for some, it keeps retrying reading until the timeout is reached. The folder that I'm testing on has 733 files. Using static llfio from vcpkg, version 2023-11-06. I based my code on this example: https://github.com/ned14/llfio/blob/develop/example/read_directory.cpp.
Everything goes well here - once the buffer of the right size is allocated, the data gets read into it. ntstat is 0.
ULONG max_bytes, bytes;
buffer = req.kernelbuffer.empty() ? reinterpret_cast<what_to_enumerate_type *>(req.buffers._kernel_buffer.get()) : reinterpret_cast<what_to_enumerate_type *>(req.kernelbuffer.data());
max_bytes = req.kernelbuffer.empty() ? static_cast<ULONG>(req.buffers._kernel_buffer_size) : static_cast<ULONG>(req.kernelbuffer.size());
bytes = std::min(max_bytes, (ULONG) kernelbuffertoallocate);
IO_STATUS_BLOCK isb = make_iostatus();
NTSTATUS ntstat = NtQueryDirectoryFile(_v.h, nullptr, nullptr, nullptr, &isb, buffer, bytes, what_to_enumerate, FALSE, req.glob.empty() ? nullptr : &_glob, TRUE);
However, this part comes next:
{
alignas(8) char _buffer[4096];
auto *buffer_ = (what_to_enumerate_type *) _buffer;
isb = make_iostatus();
ntstat = NtQueryDirectoryFile(_v.h, nullptr, nullptr, nullptr, &isb, buffer_, sizeof(_buffer), what_to_enumerate, TRUE, req.glob.empty() ? nullptr : &_glob, FALSE);
if(ntstat != 0x80000006 /*STATUS_NO_MORE_FILES*/)
{
// The directory grew between first enumeration and second
LLFIO_DEADLINE_TO_TIMEOUT_LOOP(d);
goto retry;
}
}
ntstat is 0 again... The directory did not grow, the size is constant. But the call does not return STATUS_NO_MORE_FILES. This happens in a loop. ntstat is always 0 for both calls. After the second call, buffer_ contains a single file name: st16.ogg. It is indeed a file from that folder. If I remove it, the buffer contains st17.ogg instead. Both files are present in the original buffer from the first snippet.
What is happening? I don't even care about concurrency for my use case. Can I do anything about it?
My code (simplified):
auto p = llfio::path(directory).value();
auto dirHandle= llfio::directory(p, "").value();
std::vector<llfio::directory_handle::buffer_type> buffer(8000);
llfio::directory_handle::buffers_type entries(buffer);
entries = dirHandle.read(
{std::move(entries)}
).value();
Perhaps if it's not adding more files, and not returning no-more-files, it should double the buffer and retry?
I suspect it could be a bug in the NT kernel code. Not a lot of non-Microsoft userspace code uses the NT API directly.
Perhaps if it's not adding more files, and not returning no-more-files, it should double the buffer and retry?
The buffer does not seem to be full. I made it very big on purpose. I've just tried increasing the size from 8'000 to 20'000. It makes no difference.
Here is my test folder. Let me know if you can reproduce the issue. https://mega.nz/file/tRc3HBKZ#iIHjyBU3fzdeLfc2hPx9n4qxuH68jKD33Pn6tBUSv20
Oh okay.
Just so I fully understand the problem here, for your particular directory of 733 files, NtQueryDirectoryFile even when fed an oversized buffer returns the full correct set first call, but does not return STATUS_NO_MORE_FILES to say it is done. Instead second iteration gives a single file entry, and then the third iteration gives a single file entry, and both those file entries were in the original buffer fill.
If that is correct, I have memories of encountering this before, but the way in which LLFIO calls NtQueryDirectoryFile should not produce this behaviour.
Can I check a few things?
- Is your kernel 64 bit and your userspace 32 bit?
- Is your underlying filesystem NTFS?
- Which exact Windows version are you on?
- Are you on x86, x64 or ARM?
- Did you open the handle overlapped or do anything which might make it overlapped?
System type 64-bit operating system, x64-based processor
Edition Windows 10 Education
Version 22H2
Installed on 26/10/2020
OS build 19045.4651
Experience Windows Feature Experience Pack 1000.19060.1000.0
The filesystem is NTFS on an M.2 SSD drive.
You can see how I opened the handle in the last snippet of my first post. In that snippet, directory is an absolute std::filesystem::path to the directory (including the backslash at the end).
Here is my full code: https://github.com/Bobini1/RhythmGame/blob/b3adb12b0dc5656db21a570d26fd9df8bec7d5bc/src/resource_managers/SongDbScanner.cpp
Upon further investigation:
total_itemsis 735, despite the folder containing 733 files (edit: it contains.and..). The required buffer size calculated by llfio is 83880.- The first read does actually cut out at
st15.ogg, beforest16.oggandst17.ogg. I was wrong, sorry. So the first call does not get all data from the directory even though the buffer is big enough.
I invoked this code:
static HMODULE ntdllh = GetModuleHandleA("NTDLL.DLL");
ULONG max_bytes, bytes;
char buffer[180000];
max_bytes = 180000;
IO_STATUS_BLOCK isb{};
memset(&isb, 0, sizeof(isb));
isb.Status = -1;
HANDLE dirHandle;
auto path = llfio::path("C:\\Users\\PC\\Desktop\\2007\\cutwork_lament\\").value();
auto dir = llfio::directory(path, "").value();
auto NtQueryDirectoryFile = reinterpret_cast<NtQueryDirectoryFile_t>(GetProcAddress(ntdllh, "NtQueryDirectoryFile"));
auto ntstat = NtQueryDirectoryFile(dir.native_handle().h, nullptr, nullptr, nullptr, &isb, buffer, max_bytes,
static_cast<FILE_INFORMATION_CLASS>(2), FALSE, nullptr, TRUE);
auto ntstat2 = NtQueryDirectoryFile(dir.native_handle().h, nullptr, nullptr, nullptr, &isb, buffer, max_bytes,
static_cast<FILE_INFORMATION_CLASS>(2), TRUE, nullptr, FALSE);
The result was the same as in llfio, both ntstats are 0 and the contents of buffers are the same. So I am quite sure that it is not a bug in llfio (unless the handles get created wrong).
For a different directory, I got the expected STATUS_NO_MORE_FILES from the second call.
It looks like this part of the documentation just isn't correct:
Assuming that at least one matching directory entry is found, the number of entries for which information is returned is the *smallest of the following:
One entry, if ReturnSingleEntry is TRUE and FileName is NULL. The number of entries that match the FileName string, if FileName is not NULL. (Note that if the string contains no wildcards, there can be at most one matching entry.) The number of entries whose information fits into the specified buffer. The number of entries contained in the directory.
Is it reasonable to expect to always be able fetch all directory entries at once? NTFS supports up to UINT_MAX in a folder, right? You wouldn't be able to read that much with llfio's approach anyway.
Ok, thanks very much for the additional context. You are definitely 100% sure you are compiling a 64 bit program and not a 32 bit program? I ask because the last time I saw this, it was due to the 32-to-64 bit translation layer being borked.
In any case, a solution will be needed which emits that the entries read is not an atomic snapshot. I'll put on my thinking cap.
Yes, it is a 64-bit exe:
PE signature found
File Type: EXECUTABLE IMAGE
FILE HEADER VALUES
8664 machine (x64)
8 number of sections
66A7B04C time date stamp Mon Jul 29 17:07:56 2024
0 file pointer to symbol table
0 number of symbols
F0 size of optional header
22 characteristics
Executable
Application can handle large (>2GB) addresses
Good luck with thinking! (^^) Let me know if I can help in any way.
Can I prevail on your to test this commit? I don't have access to Windows. Well, strictly speaking, I currently only have access to ARM Windows, not Intel Windows, and I'm not sure if the behaviour will be the same.
Alright, I'll try it out today. 👍
Tested! Unfortunately, the problem persists.
The main change in read seems to be this:
if((NTSTATUS) 0x80000006l /*STATUS_NO_MORE_FILES*/ == ntstat || ntstat < 0 || (buffer_->FileNameLength == 0 && buffer_->NextEntryOffset == 0)), with || (buffer_->FileNameLength == 0 && buffer_->NextEntryOffset == 0)) being the new addition.
On my test folder, this if condition becomes true when (NTSTATUS) 0x80000006l /*STATUS_NO_MORE_FILES*/ == ntstat. So the new code in the if condition is not triggered. The output buffer contains all filenames at this point.
Then in this part of code, ntstat is 0 and we go to retry again.
if(ntstat != (NTSTATUS) 0x80000006 /*STATUS_NO_MORE_FILES*/)
{
// The directory grew between first enumeration and second
LLFIO_DEADLINE_TO_TIMEOUT_LOOP(d);
goto retry;
}
Ok, thanks for testing.
What I was hoping for was that the first enumeration dragged everything into kernel cache, so the second enumeration could do an atomic snapshot. If the kernel API really is truly borked, then the additional stuff I added into the header will be needed after all.
I should have another commit for you later today.
Ok give that a try now. I now accept no more fill in the second API call as a terminating condition. Make sure it does actually fill everything in your directory.
OK, I was a bit confused before, I wasn't exactly right when I said everything was filled.
_bufferfrom line 337 contains all filenames.bufferfrom line 388 contains filenames up tost15.ogg_bufferfrom line 404 containsst16.ogg
I looked at the wrong buffer when I said everything gets filled in, sorry.
The issue persists on the latest commit. In this part, status is 0, buffer_->FileNameLength is 16 and buffer_->NextEntryOffset is 0.
if(ntstat != (NTSTATUS) 0x80000006 /*STATUS_NO_MORE_FILES*/ && !(buffer_->FileNameLength == 0 && buffer_->NextEntryOffset == 0))
{
LLFIO_DEADLINE_TO_TIMEOUT_LOOP(d);
goto retry;
}
I took a punt at writing a partially filling refactor, but it is failing unit tests so I clearly got something wrong and I'll need to use Windows to fix it.
I am travelling next few days, so it may be a while before I can look into this again. If it's urgent, feel free to tinker with the code, it should be fairly self explanatory what I was attempting.
Alright, I can try playing with it if I find some time!
In the latest commit, reading doesn't get stuck anymore but I started getting STATUS_ACCESS_VIOLATION randomly at line 410. : (
"The instruction at 0x%p referenced memory at 0x%p. The memory could not be %s."
It must be something wrong with:
if(done)
{
auto whattoadd = (7 + offsetof(what_to_enumerate_type, FileName) + ffdi->FileNameLength) & ~7;
buffer_ = reinterpret_cast<what_to_enumerate_type *>(reinterpret_cast<uintptr_t>(ffdi) + whattoadd);
bytes -= whattoadd;
}
But I see nothing obvious.
if(done)
{
auto whattoadd = (7 + offsetof(what_to_enumerate_type, FileName) + ffdi->FileNameLength) & ~7;
auto new_buffer_start = reinterpret_cast<what_to_enumerate_type *>(reinterpret_cast<uintptr_t>(ffdi) + whattoadd);
bytes -= reinterpret_cast<uintptr_t>(new_buffer_start) - reinterpret_cast<uintptr_t>(buffer_);
buffer_ = new_buffer_start;
}
It took me 2 hours to spot this. :^)
I made a PR with this change.
Many thanks for the PR, which is now merged.