llfio icon indicating copy to clipboard operation
llfio copied to clipboard

directory_handle::read is stuck

Open Bobini1 opened this issue 1 year ago • 8 comments

I keep running into problems with this function. It doesn't happen for all folders but for some, it keeps retrying reading until the timeout is reached. The folder that I'm testing on has 733 files. Using static llfio from vcpkg, version 2023-11-06. I based my code on this example: https://github.com/ned14/llfio/blob/develop/example/read_directory.cpp.

Everything goes well here - once the buffer of the right size is allocated, the data gets read into it. ntstat is 0.

ULONG max_bytes, bytes;
buffer = req.kernelbuffer.empty() ? reinterpret_cast<what_to_enumerate_type *>(req.buffers._kernel_buffer.get()) : reinterpret_cast<what_to_enumerate_type *>(req.kernelbuffer.data());
max_bytes = req.kernelbuffer.empty() ? static_cast<ULONG>(req.buffers._kernel_buffer_size) : static_cast<ULONG>(req.kernelbuffer.size());
bytes = std::min(max_bytes, (ULONG) kernelbuffertoallocate);
IO_STATUS_BLOCK isb = make_iostatus();
NTSTATUS ntstat = NtQueryDirectoryFile(_v.h, nullptr, nullptr, nullptr, &isb, buffer, bytes, what_to_enumerate, FALSE, req.glob.empty() ? nullptr : &_glob, TRUE);

However, this part comes next:

{
  alignas(8) char _buffer[4096];
  auto *buffer_ = (what_to_enumerate_type *) _buffer;
  isb = make_iostatus();
  ntstat = NtQueryDirectoryFile(_v.h, nullptr, nullptr, nullptr, &isb, buffer_, sizeof(_buffer), what_to_enumerate, TRUE, req.glob.empty() ? nullptr : &_glob, FALSE);
  if(ntstat != 0x80000006 /*STATUS_NO_MORE_FILES*/)
  {
    // The directory grew between first enumeration and second
    LLFIO_DEADLINE_TO_TIMEOUT_LOOP(d);
    goto retry;
  }
}

ntstat is 0 again... The directory did not grow, the size is constant. But the call does not return STATUS_NO_MORE_FILES. This happens in a loop. ntstat is always 0 for both calls. After the second call, buffer_ contains a single file name: st16.ogg. It is indeed a file from that folder. If I remove it, the buffer contains st17.ogg instead. Both files are present in the original buffer from the first snippet.

What is happening? I don't even care about concurrency for my use case. Can I do anything about it?

My code (simplified):

auto p = llfio::path(directory).value();
auto dirHandle= llfio::directory(p, "").value();
std::vector<llfio::directory_handle::buffer_type> buffer(8000);
llfio::directory_handle::buffers_type entries(buffer);
entries = dirHandle.read(
          {std::move(entries)}
          ).value();

Bobini1 avatar Jul 26 '24 04:07 Bobini1

Perhaps if it's not adding more files, and not returning no-more-files, it should double the buffer and retry?

I suspect it could be a bug in the NT kernel code. Not a lot of non-Microsoft userspace code uses the NT API directly.

ned14 avatar Jul 26 '24 12:07 ned14

Perhaps if it's not adding more files, and not returning no-more-files, it should double the buffer and retry?

The buffer does not seem to be full. I made it very big on purpose. I've just tried increasing the size from 8'000 to 20'000. It makes no difference.

Bobini1 avatar Jul 26 '24 12:07 Bobini1

Here is my test folder. Let me know if you can reproduce the issue. https://mega.nz/file/tRc3HBKZ#iIHjyBU3fzdeLfc2hPx9n4qxuH68jKD33Pn6tBUSv20

Bobini1 avatar Jul 26 '24 12:07 Bobini1

Oh okay.

Just so I fully understand the problem here, for your particular directory of 733 files, NtQueryDirectoryFile even when fed an oversized buffer returns the full correct set first call, but does not return STATUS_NO_MORE_FILES to say it is done. Instead second iteration gives a single file entry, and then the third iteration gives a single file entry, and both those file entries were in the original buffer fill.

If that is correct, I have memories of encountering this before, but the way in which LLFIO calls NtQueryDirectoryFile should not produce this behaviour.

Can I check a few things?

  • Is your kernel 64 bit and your userspace 32 bit?
  • Is your underlying filesystem NTFS?
  • Which exact Windows version are you on?
  • Are you on x86, x64 or ARM?
  • Did you open the handle overlapped or do anything which might make it overlapped?

ned14 avatar Jul 26 '24 14:07 ned14

System type	64-bit operating system, x64-based processor
Edition	Windows 10 Education
Version	22H2
Installed on	‎26/‎10/‎2020
OS build	19045.4651
Experience	Windows Feature Experience Pack 1000.19060.1000.0

The filesystem is NTFS on an M.2 SSD drive. You can see how I opened the handle in the last snippet of my first post. In that snippet, directory is an absolute std::filesystem::path to the directory (including the backslash at the end).

Bobini1 avatar Jul 26 '24 14:07 Bobini1

Here is my full code: https://github.com/Bobini1/RhythmGame/blob/b3adb12b0dc5656db21a570d26fd9df8bec7d5bc/src/resource_managers/SongDbScanner.cpp

Bobini1 avatar Jul 26 '24 14:07 Bobini1

Upon further investigation:

  • total_items is 735, despite the folder containing 733 files (edit: it contains . and ..). The required buffer size calculated by llfio is 83880.
  • The first read does actually cut out at st15.ogg, before st16.ogg and st17.ogg. I was wrong, sorry. So the first call does not get all data from the directory even though the buffer is big enough.

I invoked this code:

static HMODULE ntdllh = GetModuleHandleA("NTDLL.DLL");
ULONG max_bytes, bytes;
char buffer[180000];
max_bytes = 180000;
IO_STATUS_BLOCK isb{};
memset(&isb, 0, sizeof(isb));
isb.Status = -1;
HANDLE dirHandle;
auto path = llfio::path("C:\\Users\\PC\\Desktop\\2007\\cutwork_lament\\").value();
auto dir = llfio::directory(path, "").value();
auto NtQueryDirectoryFile = reinterpret_cast<NtQueryDirectoryFile_t>(GetProcAddress(ntdllh, "NtQueryDirectoryFile"));
auto ntstat = NtQueryDirectoryFile(dir.native_handle().h, nullptr, nullptr, nullptr, &isb, buffer, max_bytes,
                                   static_cast<FILE_INFORMATION_CLASS>(2), FALSE, nullptr, TRUE);
auto ntstat2 = NtQueryDirectoryFile(dir.native_handle().h, nullptr, nullptr, nullptr, &isb, buffer, max_bytes,
                              static_cast<FILE_INFORMATION_CLASS>(2), TRUE, nullptr, FALSE);

The result was the same as in llfio, both ntstats are 0 and the contents of buffers are the same. So I am quite sure that it is not a bug in llfio (unless the handles get created wrong).

For a different directory, I got the expected STATUS_NO_MORE_FILES from the second call.

It looks like this part of the documentation just isn't correct:

Assuming that at least one matching directory entry is found, the number of entries for which information is returned is the *smallest of the following:

One entry, if ReturnSingleEntry is TRUE and FileName is NULL. The number of entries that match the FileName string, if FileName is not NULL. (Note that if the string contains no wildcards, there can be at most one matching entry.) The number of entries whose information fits into the specified buffer. The number of entries contained in the directory.

Bobini1 avatar Jul 26 '24 23:07 Bobini1

Is it reasonable to expect to always be able fetch all directory entries at once? NTFS supports up to UINT_MAX in a folder, right? You wouldn't be able to read that much with llfio's approach anyway.

Bobini1 avatar Jul 27 '24 09:07 Bobini1

Ok, thanks very much for the additional context. You are definitely 100% sure you are compiling a 64 bit program and not a 32 bit program? I ask because the last time I saw this, it was due to the 32-to-64 bit translation layer being borked.

In any case, a solution will be needed which emits that the entries read is not an atomic snapshot. I'll put on my thinking cap.

ned14 avatar Jul 29 '24 13:07 ned14

Yes, it is a 64-bit exe:

PE signature found

File Type: EXECUTABLE IMAGE

FILE HEADER VALUES
            8664 machine (x64)
               8 number of sections
        66A7B04C time date stamp Mon Jul 29 17:07:56 2024
               0 file pointer to symbol table
               0 number of symbols
              F0 size of optional header
              22 characteristics
                   Executable
                   Application can handle large (>2GB) addresses

Good luck with thinking! (^^) Let me know if I can help in any way.

Bobini1 avatar Jul 29 '24 16:07 Bobini1

Can I prevail on your to test this commit? I don't have access to Windows. Well, strictly speaking, I currently only have access to ARM Windows, not Intel Windows, and I'm not sure if the behaviour will be the same.

ned14 avatar Jul 30 '24 07:07 ned14

Alright, I'll try it out today. 👍

Bobini1 avatar Jul 30 '24 08:07 Bobini1

Tested! Unfortunately, the problem persists.

The main change in read seems to be this: if((NTSTATUS) 0x80000006l /*STATUS_NO_MORE_FILES*/ == ntstat || ntstat < 0 || (buffer_->FileNameLength == 0 && buffer_->NextEntryOffset == 0)), with || (buffer_->FileNameLength == 0 && buffer_->NextEntryOffset == 0)) being the new addition.

On my test folder, this if condition becomes true when (NTSTATUS) 0x80000006l /*STATUS_NO_MORE_FILES*/ == ntstat. So the new code in the if condition is not triggered. The output buffer contains all filenames at this point.

Then in this part of code, ntstat is 0 and we go to retry again.

if(ntstat != (NTSTATUS) 0x80000006 /*STATUS_NO_MORE_FILES*/)
{
  // The directory grew between first enumeration and second
  LLFIO_DEADLINE_TO_TIMEOUT_LOOP(d);
  goto retry;
}

Bobini1 avatar Jul 30 '24 12:07 Bobini1

Ok, thanks for testing.

What I was hoping for was that the first enumeration dragged everything into kernel cache, so the second enumeration could do an atomic snapshot. If the kernel API really is truly borked, then the additional stuff I added into the header will be needed after all.

I should have another commit for you later today.

ned14 avatar Jul 30 '24 12:07 ned14

Ok give that a try now. I now accept no more fill in the second API call as a terminating condition. Make sure it does actually fill everything in your directory.

ned14 avatar Jul 30 '24 14:07 ned14

OK, I was a bit confused before, I wasn't exactly right when I said everything was filled.

  • _buffer from line 337 contains all filenames.
  • buffer from line 388 contains filenames up to st15.ogg
  • _buffer from line 404 contains st16.ogg

I looked at the wrong buffer when I said everything gets filled in, sorry. The issue persists on the latest commit. In this part, status is 0, buffer_->FileNameLength is 16 and buffer_->NextEntryOffset is 0.

if(ntstat != (NTSTATUS) 0x80000006 /*STATUS_NO_MORE_FILES*/ && !(buffer_->FileNameLength == 0 && buffer_->NextEntryOffset == 0))
{
  LLFIO_DEADLINE_TO_TIMEOUT_LOOP(d);
  goto retry;
}

Bobini1 avatar Jul 30 '24 19:07 Bobini1

I took a punt at writing a partially filling refactor, but it is failing unit tests so I clearly got something wrong and I'll need to use Windows to fix it.

I am travelling next few days, so it may be a while before I can look into this again. If it's urgent, feel free to tinker with the code, it should be fairly self explanatory what I was attempting.

ned14 avatar Jul 30 '24 22:07 ned14

Alright, I can try playing with it if I find some time! In the latest commit, reading doesn't get stuck anymore but I started getting STATUS_ACCESS_VIOLATION randomly at line 410. : (

"The instruction at 0x%p referenced memory at 0x%p. The memory could not be %s."

Bobini1 avatar Jul 30 '24 22:07 Bobini1

It must be something wrong with:

    if(done)
    {
      auto whattoadd = (7 + offsetof(what_to_enumerate_type, FileName) + ffdi->FileNameLength) & ~7;
      buffer_ = reinterpret_cast<what_to_enumerate_type *>(reinterpret_cast<uintptr_t>(ffdi) + whattoadd);
      bytes -= whattoadd;
    }

But I see nothing obvious.

ned14 avatar Jul 30 '24 22:07 ned14

if(done)
{
  auto whattoadd = (7 + offsetof(what_to_enumerate_type, FileName) + ffdi->FileNameLength) & ~7;
  auto new_buffer_start = reinterpret_cast<what_to_enumerate_type *>(reinterpret_cast<uintptr_t>(ffdi) + whattoadd);
  bytes -= reinterpret_cast<uintptr_t>(new_buffer_start) - reinterpret_cast<uintptr_t>(buffer_);
  buffer_ = new_buffer_start;
}

It took me 2 hours to spot this. :^)

Bobini1 avatar Jul 31 '24 03:07 Bobini1

I made a PR with this change.

Bobini1 avatar Jul 31 '24 03:07 Bobini1

Many thanks for the PR, which is now merged.

ned14 avatar Aug 13 '24 11:08 ned14