filepath icon indicating copy to clipboard operation
filepath copied to clipboard

Null-terminated PosixString ByteArray# for performance

Open hasufell opened this issue 3 years ago • 2 comments

In GitLab by @alt-romes on Sep 29, 2022, 24:12

Hi!

This is issue is meant to discuss some performance tuning and ideas.

I was recently trying to write a programming using directory and filepath for filesystem manipulation. Something initially in the spirit of du from coreutils.

The problem is the performance is not great.

In writing a recursive directory traversal I must check constantly whether a path is a directory or not to know whether to recurse into it. The function doesDirectoryExist takes 80% of runtime -- which might be right due to having to read the metadata.

I attempted to write a faster fastIsDir by addressing benchmark results.

Here's an attempt:

fastIsDir :: OsPath -> IO Bool
fastIsDir (getPosixString . getOsString -> SBS path) = do
  fp <- mallocForeignPtrBytes (144)
  withForeignPtr fp $ \p -> useAsCString (SBS path) $ \s -> c_stat s p
  return (fileTypeIsDirectory $ fileTypeFromMetadata $ FileStatus fp)

This version ignores errno to go faster, but it clearly shows the use of useAsCString, which copies O(n) the path into a null-terminated bytearray. Since I want to run this on all the files in the filesystem, I think the cost has some impact.

Now, I think that c_stat doesn't need a copy of the string (it's read only, right?), so ideally we would pass path directly to c_stat:

fastIsDir :: OsPath -> IO Bool
fastIsDir (getPosixString . getOsString -> SBS path) = do
  fp <- mallocForeignPtrBytes (144)
  let ptr = Ptr (byteArrayContents# path)
  withForeignPtr fp $ \p -> c_stat ptr p
  return (fileTypeIsDirectory $ fileTypeFromMetadata $ FileStatus fp)

The issue is that path, I believe, is not null-terminated, which makes this code wrong. The only way I see to add a NULL at the end of the string is copying it into a length+1 array and terminate it manually (which is what useAsCString does).

What I want to discuss in this issue is:

Is it possible, since PosixString is opaque up to Internal, to have PosixString be under the hood a null-terminated ByteArray# such that libraries like directory can directly pass it to read-only functions without the need for copying it into the stack to add a null terminator?

I might be able to attempt an implementation, given some pointers.

Thanks, Romes

hasufell avatar Sep 28 '22 16:09 hasufell

In GitLab by @maerwald on Oct 28, 2022, 10:41

I'm not involved in directory package and try to avoid it usually.

You might be interested in:

  • https://github.com/haskell/unix/pull/251
  • https://hackage.haskell.org/package/hpath-posix-0.13.3/docs/System-Posix-RawFilePath-Directory-Traversals.html#v:readDirEnt

Wrt filepaths itself: if you read thousands of files via pinned memory, you will possibly cause very high memory fragmentation. The point of the new type is to avoid memory fragmentation.

See:

  1. https://hasufell.github.io/posts/2022-06-29-fixing-haskell-filepaths.html
  2. https://github.com/hasufell/filepath-debug/blob/master/result.txt

Adding null termination ad-hoc doesn't seem like a good idea for such a specific use case. It will complicate all the code.

I'm also not convinced that you will have better performance with pinned ByteString. Do you have proof?

hasufell avatar Oct 28 '22 02:10 hasufell

In GitLab by @alt-romes on Nov 3, 2022, 20:31

Okay, I'm in agreement, and indeed, it is a very specific use case. The links you provided are helpful and might be a better alternative.

I don't have concrete proof, I thought it might be a cool idea and hence the issue for discussion.

I think meanwhile we can close this and if I take the chance to think about this again I can reopen it.

Thanks!

hasufell avatar Nov 03 '22 12:11 hasufell