conduit icon indicating copy to clipboard operation
conduit copied to clipboard

sourceDirectoryDeep False could be much faster on Linux for large directories

Open nh2 opened this issue 5 years ago • 20 comments

Just writing this down somewhere, with a suggestion for the solution:

sourceDirectoryDeep unnecessarily runs the stat() syscall on every file, thus making it slow on large directories or directory trees. For example, it takes 30 minutes on my server to run it on a directory tree with 8M files. find does it in 30 seconds. sourceDirectory does not have this problem, but cannot traverse into subdirs. On Linux, the problem can be observed with strace.

If it didn't stat(), only using getdents64() to list the directory contents (which retrieves multiple file names in a batch), it would be at least 60 times faster.

The stat() happens in the call to getFileType here:

https://github.com/snoyberg/conduit/blob/0bf72ae4c5a765517ef69df650456c11b1e8fbf1/conduit/src/Data/Conduit/Combinators.hs#L737

It does that to distinguish normal files from symlinks to implement the followSymlinks :: Bool feature.

However the find UNIX utility also doesn't stat() (unless the -L option is given to follow symlinks); yet it can traverse into subdirectories. How does it do it?

First the call graph:

Now man 3 readdir shows that for one readdir() call, the following information about a single directory entry dirent is returned (behind the scenes, the libc usually requests multiple -- usually 32 KiB -- of those in a batch via the getdents() syscall to be efficient):

struct dirent {
    ...
    unsigned char  d_type;      /* Type of file; not supported
                                   by all filesystem types */
    ...
};

The char d_type already tells us if the directory entry is a subdirectory or a normal file. We do not need to call stat() (via getFileType) to determine that. We only need to do that when we want to know if something is a symlink.

That means sourceDirectoryDeep could be optimised to work like find: Only if followSymlinks = True do we need to getFileType.

Unfortunately, streaming-commons's readDirStream is too simplistic for this purpose: It only returns Maybe FilePath, not propagating whether that FilePath is a directory.

This means that either sourceDirectoryDeep cannot use streaming-commons for that purpose, or streaming-commons needs to add a function that forwards that information when present.

CC @Gabriel439 as you may also have callers to readDirStream. However, I checked some pipes packages, and found that pipes-files by @jwiegley has functionality to skip the stat() mentioned in the Performance section, and pipes-filesystem by @jonathanknowles also has a Performance section that addresses the exact problem I'm mentioning here.

Other new IO packages also have addressed this issue: stdio here and Z-IO here.

Those also have the benefit of not using type FilePath = String, but ByteString based/equivalent IO, which is naturally much lighter on the CPU.

Another small complication is that this works only on Linux; a POSIX dirent does not have the file type, and some more exotic Linux file systems may also not retrieve this information. This can be addressed as the man 3 readdir explains for d_type, checking for DT_UNKNOWN and doing a stat() if encountered:

DT_UNKNOWN  The file type could not be determined.

Currently,  only some filesystems (among them: Btrfs, ext2, ext3, and ext4) have full support for returning the file type in d_type.
All applications must properly handle a return of DT_UNKNOWN.

An implementation of that can be seen in the posix-paths package that I wrote with @JohnLato in 2013:

        isDir <- case () of
            () | typ == dtDir     -> return True
               | typ == dtUnknown -> isDirectory <$> getFileStatus path
               | otherwise        -> return False

@snoyberg Do you have a preference if/how this should be done? Should we make streaming-commons smarter by adding a function there?


Side note: One can be even faster by:

  • not using the default 32 KiB buffer that readdir() gives to getdents(), but using a larger, or configurable buffer. That is described in the old post http://be-n.com/spw/you-can-list-a-million-files-in-a-directory-but-not-with-ls.html. Note that post is slightly confusing because:

    • It talks about 32 K entries, when it's really Bytes.
    • It claims that "16416 system calls of getdents()" is somehow much or a problem ("That is a lot of calls, especially on a slow virtualized disk"), but ~10k syscalls it not a lot and should rarely be a problem.
    • It is generally not a problem to list directories with 100s of millions of files using find or ls -1U.

    The mentioned optimisation (of giving getdents64() larger buffers) is mainly useful for networked file systems though, where larger batching reduces roundtrips further. (For medium-long paths one 32 KiB getdents() fits >= 500 files, by the way.)

  • using ByteStrings to save CPU.

nh2 avatar Feb 19 '21 21:02 nh2

I just opened https://github.com/haskell/unix/issues/345 without having checked here first.

sjshuck avatar Jun 29 '25 20:06 sjshuck

Okay, I think I have a pretty good idea of where we're at.

  1. streaming-commons's Data.Streaming.Filesystem. I agree it's not powerful enough. See directory-ospath-streaming's System.Directory.OsPath.Streaming here, which has a nearly identical interface, except: a. implements fast recursive directory walking, courtesy d_type, b. exports a slightly different datatype representing file types, and c. does not use FilePaths but rather OsPaths which are ShortByteStrings under the hood. Thanks @Bodigrim for the tip.
  2. d_type not existing on all platforms. I went through all platforms GHC can be downloaded for and looked at their libc's dirent.h as well as musl's. All define d_type despite that field not being specified by POSIX, as you pointed out @nh2. So in practice I think we can assume it exists.
  3. d_type always equalling DT_UNKNOWN for some filesystems. Fine, fall back on stat().
  4. d_type access not being part of a stable API. The note in the unix module states "internal module, no PVP guarantees". Normally I would worry about this, but it doesn't strike me as internal in character. Maybe they're cautious because d_type is technically non-standard?

The more that I look at directory-ospath-streaming, the more I think it would be a good replacement for the guts of streaming-commons Data.Streaming.Filesystem. I don't think it requires any additional transitive dependencies besides atomic-counter. streaming-commons itself could retain the same exact public API, but benefit from reduced maintenance burden of a smaller code base as well as expose some additional function like

readDirStreamTyped :: DirStream -> IO (Maybe (FilePath, FileType))

that conduit could use for an optimized sourceDirectoryDeep.

And I think relying on unix's System.Posix.Directory.Internals is a small risk/cost for a potentially big reward.

sjshuck avatar Jul 01 '25 02:07 sjshuck

On further inspection I think that library is different enough from this one that it wouldn't be a good choice. The fundamental datatype of both libraries is DirStream, and they are incompatible. streaming-commons re-exports unix's, which is a newtype wrapping a Ptr CDir. directory-ospath-streaming's is a record that contains housekeeping to allow the GC to auto-close it, which is fundamentally at odds with conduit's philosophy of prompt resource deallocation.

I'll look into making streaming-commons smarter.

sjshuck avatar Jul 01 '25 17:07 sjshuck

@sjshuck GHCup has its own in-house sourceDirectoryDeep variant (called sourceDirectoryDeep') that I think is essentially what you are looking for. You can see that instead of readDirStream used by sourceDirectoryDeep, sourceDirectoryDeep' instead makes use of readDirEntPortable, also defined within GHCup. That one returns IO (DirType, FilePath), thereby avoiding the problem arising from the use of readDirStream mentioned in the original post in this thread:

Unfortunately, streaming-commons's readDirStream is too simplistic for this purpose: It only returns Maybe FilePath, not propagating whether that FilePath is a directory.

All in all, I think the full solution is there, so it is "just" a matter of extracting the code to where they belong (whether it be conduit, streaming-commons, or someplace else). For the last few weeks I have had the idea of tackling this in the back of my mind, but couldn't really get to it; so maybe it will be useful to you.

ozkutuk avatar Jul 01 '25 20:07 ozkutuk

I'll look into making streaming-commons smarter.

MInd you that directory-ospath-streaming was born out of https://github.com/fpco/streaming-commons/pull/75

directory-ospath-streaming's is a record that contains housekeeping to allow the GC to auto-close it, which is fundamentally at odds with conduit's philosophy of prompt resource deallocation.

I don't think so, directory-ospath-streaming makes every effort to deallocate resources promptly. The weak finalizer is only for exceptional cases, as the last resort. It's pretty much the same as a usual Handle works: the underlying MVar Handle__ has a weak finalizer.

Bodigrim avatar Jul 01 '25 20:07 Bodigrim

sourceDirectoryDeep'

Yep, I think that's the goal. Thanks @ozkutuk.

makes every effort to deallocate resources promptly

I see that now. I guess I saw the lazy IO [a] and the finalizers and assumed.

Partly due to not having nearly the appreciation for performance issues that @sergv has and partly because of hitting my own roadblock with OsString-related package version issues today, I'd like to limit the scope of my work to just reducing stat() calls and changing the APIs of ~streaming-commons and~ conduit as little as possible. I suspect that'd be the bulk of the speed improvements anyway.

CORRECTION: just noticed streaming-commons is no longer a dependency of conduit. The former's code relevant to this discussion looks copy-pasted into the latter.

sjshuck avatar Jul 01 '25 23:07 sjshuck

@sjshuck I was mostly interested in having streaming-commons-compatible interface that works with OsPaths. 8 million files were not a target. Apparently to support them one has to make getdents syscall by hand which has not been done.

All optimizations described by OP, except getdents syscall, are more or less present in the directory-ospath-streaming package mainly thanks to advances in the unix package which provides all the necessary interfaces.

sergv avatar Jul 02 '25 10:07 sergv

8 million files were not a target. Apparently to support them one has to make getdents syscall by hand which has not been done.

Note that larger getdents buffers are not needed to list larger directories. I've updated the issue description to include some more detail:

  • It claims that "16416 system calls of getdents()" is somehow much or a problem ("That is a lot of calls, especially on a slow virtualized disk"), but ~10k syscalls it not a lot and should rarely be a problem.
  • It is generally not a problem to list directories with 100s of millions of files using find or ls -1U.

nh2 avatar Jul 02 '25 17:07 nh2

https://github.com/sjshuck/conduit/commit/2c401a5c9d34d43529369ad0a90dea3221d56851 This commit uses d_type to eliminate as few calls to stat() and lstat() as possible, and resulted in absolutely no performance improvement. Debugging shows me I'm always getting d_type == DT_UNKNOWN for it. 😢 btrfs, Linux 6.15.3. Probably my fault.

sjshuck avatar Jul 02 '25 19:07 sjshuck

Debugging shows me I'm always getting d_type == DT_UNKNOWN for it. 😢 btrfs, Linux 6.15.3.

Interesting, are you sure that is due to Btrfs? readdir manpage states that Btrfs has full support for d_type.

ozkutuk avatar Jul 02 '25 20:07 ozkutuk

My buddy did a test in C on his machine and it's fine. I'm for sure doing some dumb with Ptrs; wouldn't be the first time. Although I can't see what it is. This is the callback being passed to readDirStreamWith. Oddly, the left side of the tuple fetches the correct file name; all the tests pass. It's only the right side that when printed always says

DirType 0

sjshuck avatar Jul 02 '25 20:07 sjshuck

I have a feeling that this might be due to laziness. By the time the components of the (FilePath, Maybe FileType) tuple is forced, the underlying dirent struct is already gone (probably). You can try forcing the components of the tuple within peekFpAndType. Something like this should get the job done:

peekFpAndType :: I.DirEnt -> IO (FilePath, Maybe FileType)
peekFpAndType de = do
  !fp <- I.dirEntName de >>= I.peekFilePath
  !ft <- dTypeToFileType <$> I.dirEntType de
  pure (fp, ft)

ozkutuk avatar Jul 02 '25 20:07 ozkutuk

I tried that and it didn't work. I could see some post-processing being lazy (like fmap dTypeToFileType) but fundamentally the IO monad has strict >>= so the underlying call to the FFI-imported d_type :: _ -> IO _ will take place before the struct is freed.

sjshuck avatar Jul 02 '25 21:07 sjshuck

Debugging shows me I'm always getting d_type == DT_UNKNOWN

Use strace on find to determine the ultimate truth of what the syscall returns on your filesystem:

mkdir -p testdir && touch testdir/myfile && mkdir -p testdir/mysubdir
strace -fyyve getdents64 find testdir > /dev/null

Example output on ext4:

getdents64(4</home/niklas/tmp/testdir>, [{d_ino=7091943, d_off=4294961534192608376, d_reclen=24, d_type=DT_DIR, d_name="."}, {d_ino=7092838, d_off=5953854646970289146, d_reclen=32, d_type=DT_DIR, d_name="mysubdir"}, {d_ino=7092829, d_off=7828712696275270161, d_reclen=32, d_type=DT_REG, d_name="myfile"}, {d_ino=3810163, d_off=9223372036854775807, d_reclen=24, d_type=DT_DIR, d_name=".."}], 32768) = 112

Note d_type=DT_DIR and d_type=DT_REG in the output.

nh2 avatar Jul 02 '25 21:07 nh2

Mystery solved. Huge credit to @chenxiaolong for diving into C code and autotools.

The following code works. My runtime has gone down by a factor of 4.

import Foreign (plusPtr, peek)

peekFpAndType :: I.DirEnt -> IO (FilePath, Maybe FileType)
peekFpAndType de@(I.DirEnt cDirentPtr) = liftA2 (,)
    (I.dirEntName de >>= I.peekFilePath)
    (fmap (dTypeToFileType . I.DirType) $ peek $ cDirentPtr `plusPtr` 18)

I am not relying on the dirEntType getter from System.Posix.Directory.Internals, but instead am hard-coding the struct field offset. When I use dirEntType, it's always returning 0, because the #else is firing in this code: https://github.com/haskell/unix/blob/eced23b356ed00510a7e6735b3f093f0c7f94081/cbits/HsUnix.c#L109-L113

That macro test shouldn't be HAVE_DIRENT_D_TYPE. It should be HAVE_STRUCT_DIRENT_D_TYPE, which is what it is here. See this doc.

I'll open a bug report over there. Meanwhile, how did you get your d_type working @sergv? I'm impressed (and blinded) by all the unboxed IO going on there, but I also see references to DirInternals.dirEntType.

sjshuck avatar Jul 02 '25 22:07 sjshuck

Oh no. Good job finding that bug!

A lot of code on the Internet seems to use HAVE_DIRENT_D_TYPE: https://grep.app/search?q=HAVE_DIRENT_D_TYPE

That includes cpython -- why does it work there? Looks like because they test and define it themselves.

nh2 avatar Jul 02 '25 23:07 nh2

@sjshuck I measured about 5% improvement thanks to reusing dirent struct between the calls and decided that's good enough. I haven't checked whether the accessor always returns 0, I just assumed it would work which seems to have been putting too much faith in it.

sergv avatar Jul 03 '25 10:07 sergv

Off-topic but I don't think

reusing dirent struct between the calls

ever actually occurs. As written the unix library unconditionally allocates a new one on each call. Ironically, if we don't have HAVE_READDIR_R and are forced to use readdir(), we might reuse the struct, since its manpage states

On success, readdir() returns a pointer to a dirent structure. (This structure may be statically allocated; do not attempt to free(3) it.)

sjshuck avatar Jul 03 '25 12:07 sjshuck

@sjshuck Really don't want to derail discussion any further, but there's https://hackage-content.haskell.org/package/unix-2.8.7.0/docs/System-Posix-Directory-Internals.html#v:readDirStreamWithPtr which did produce some benchmark improvements, however slight, for me.

sergv avatar Jul 03 '25 13:07 sergv

unix-2.8.8.0 was released today which fixes dirEntType. Waiting for review of the above PR.

sjshuck avatar Sep 23 '25 11:09 sjshuck