filepath
filepath copied to clipboard
Null-terminated PosixString ByteArray# for performance
In GitLab by @alt-romes on Sep 29, 2022, 24:12
Hi!
This is issue is meant to discuss some performance tuning and ideas.
I was recently trying to write a programming using directory and filepath for filesystem manipulation.
Something initially in the spirit of du from coreutils.
The problem is the performance is not great.
In writing a recursive directory traversal I must check constantly whether a path is a directory or not to know whether to recurse into it. The function doesDirectoryExist takes 80% of runtime -- which might be right due to having to read the metadata.
I attempted to write a faster fastIsDir by addressing benchmark results.
Here's an attempt:
fastIsDir :: OsPath -> IO Bool
fastIsDir (getPosixString . getOsString -> SBS path) = do
fp <- mallocForeignPtrBytes (144)
withForeignPtr fp $ \p -> useAsCString (SBS path) $ \s -> c_stat s p
return (fileTypeIsDirectory $ fileTypeFromMetadata $ FileStatus fp)
This version ignores errno to go faster, but it clearly shows the use of useAsCString, which copies O(n) the path into a null-terminated bytearray. Since I want to run this on all the files in the filesystem, I think the cost has some impact.
Now, I think that c_stat doesn't need a copy of the string (it's read only, right?), so ideally we would pass path directly to c_stat:
fastIsDir :: OsPath -> IO Bool
fastIsDir (getPosixString . getOsString -> SBS path) = do
fp <- mallocForeignPtrBytes (144)
let ptr = Ptr (byteArrayContents# path)
withForeignPtr fp $ \p -> c_stat ptr p
return (fileTypeIsDirectory $ fileTypeFromMetadata $ FileStatus fp)
The issue is that path, I believe, is not null-terminated, which makes this code wrong. The only way I see to add a NULL at the end of the string is copying it into a length+1 array and terminate it manually (which is what useAsCString does).
What I want to discuss in this issue is:
Is it possible, since PosixString is opaque up to Internal, to have PosixString be under the hood a null-terminated ByteArray# such that libraries like directory can directly pass it to read-only functions without the need for copying it into the stack to add a null terminator?
I might be able to attempt an implementation, given some pointers.
Thanks, Romes
In GitLab by @maerwald on Oct 28, 2022, 10:41
I'm not involved in directory package and try to avoid it usually.
You might be interested in:
- https://github.com/haskell/unix/pull/251
- https://hackage.haskell.org/package/hpath-posix-0.13.3/docs/System-Posix-RawFilePath-Directory-Traversals.html#v:readDirEnt
Wrt filepaths itself: if you read thousands of files via pinned memory, you will possibly cause very high memory fragmentation. The point of the new type is to avoid memory fragmentation.
See:
- https://hasufell.github.io/posts/2022-06-29-fixing-haskell-filepaths.html
- https://github.com/hasufell/filepath-debug/blob/master/result.txt
Adding null termination ad-hoc doesn't seem like a good idea for such a specific use case. It will complicate all the code.
I'm also not convinced that you will have better performance with pinned ByteString. Do you have proof?
In GitLab by @alt-romes on Nov 3, 2022, 20:31
Okay, I'm in agreement, and indeed, it is a very specific use case. The links you provided are helpful and might be a better alternative.
I don't have concrete proof, I thought it might be a cool idea and hence the issue for discussion.
I think meanwhile we can close this and if I take the chance to think about this again I can reopen it.
Thanks!