unix
unix copied to clipboard
Expose dirent pointer and added readDirStreamWith
This PR adds two new functions readDirStreamWith and readDirStreamWithPtr:
readDirStreamWithis a version ofreadDirStreamMaybethat takes a callback that is used to obtain the result from the pointer to the directory entry. This directory pointer is wrapped in a newtype calledDirEnt. The function is intended to be used in cases where we know about the features of the installed libc: For example, glibc provides the information about the type of the directory entry directly in the dirent struct. Therefore it can be used directly without calling further functions and additional calls to stat(2). Since those extensions to dirent are not covered by POSIX but widely used, we provide only the possibility to make use of those and do not add this functionality tounix.readDirStreamWithPtrtakes a pre-allocated pointer in addition to the other arguments ofreadDirStreamWith. This pointer is used to store the pointer of the dirent struct (if there is any) which can safe some allocations when you want to read a lot of directory entries as this pre-allocated pointer can be shared between the calls to the function.
In addition to that a small CPP bug was fixed which prevented compilation on my system. Also, some tests for the readDirStream functions were added.
Why not conditionally provide a new high-level Linux/glibc-specific API that returns a directory entry and the file type. The conditional test here would be the availability of the DT_REG and DT_UNKNOWN entry types. The module would then also expose pattern synonyms for the expected types.
I think this would be cleaner/safer than exposing the proposed API. As for avoiding allocation of a dirent structure, with all the attendant syscall overhead, the memory GC cost should be fairly modest.
So I am not entirely enthused by this. Which is not a hard no, but significant scepticism.
Am more in favor of https://hackage.haskell.org/package/hpath-posix-0.13.3/docs/System-Posix-RawFilePath-Directory-Traversals.html#v:readDirEnt
@vdukhovni
Why not conditionally provide a new high-level Linux/glibc-specific API that returns a directory entry and the file type. The conditional test here would be the availability of the
DT_REGandDT_UNKNOWNentry types. The module would then also expose pattern synonyms for the expected types.
It was my understanding that the unix package is about POSIX-compliance and additional features should be implemented in upstream libraries like hpath-path. This patch aims to make it easier for upstream libraries to do that as they don't have to copy/replicate the work done here. If the conditional high-level API you proposed would be accepted as a contribution I'd be more than happy to implement those.
I think this would be cleaner/safer than exposing the proposed API. As for avoiding allocation of a dirent structure, with all the attendant syscall overhead, the memory GC cost should be fairly modest.
Ok, that's fair. I consider readDirStreamWithPtr as optional and I am willing to remove this function if requested
@hasufell
Am more in favor of https://hackage.haskell.org/package/hpath-posix-0.13.3/docs/System-Posix-RawFilePath-Directory-Traversals.html#v:readDirEnt
I am aware of hpath-posix (Great library!) and the purpose of this patch is precisely to make libraries like this one easier to implement. For example, I noticed that the function linked by you is essentially a copy of the readDirStream + d_type. My hope is that this contribution will reduce the need for code duplication like this.
I'm not opposed to this. Can we vote on this @hs-viktor @vdukhovni @Bodigrim
:+1: from me
I'm not opposed to this. Can we vote on this @hs-viktor @vdukhovni @Bodigrim
👍 from me
Which this are you in favour of? As I said, I'd rather not expose the proposed API and instead support the higher-level interface.
I'm not opposed to this. Can we vote on this @hs-viktor @vdukhovni @Bodigrim
👍 from me
Which
thisare you in favour of? As I said, I'd rather not expose the proposed API and instead support the higher-level interface.
The changes in this PR. I'm fine with these low-level interfaces. Alternatively, we can also put them into Internal module and expose it to signal "you're on your own".
If no other maintainer chimes in, then I'll merge this in 2 weeks.
@hs-viktor @vdukhovni just a gentle reminder.
I still mostly stand by my original comment, and perhaps expose this only as an Internal API with no stability guarantees. Maintainers of dependent packages would have to be willing to adapt to future changes...
Why not just support the higher-level (system-specific) API?
I'm fine with exposing this as Internal.
https://github.com/haskell/ghcup-hs/issues/415#issuecomment-1344184716
One more reason to merge this and add tests.
This is way out of my area of expertise, but I'm inclined to agree with Viktor.
It was my understanding that the unix package is about POSIX-compliance and additional features should be implemented in upstream libraries like hpath-path. This patch aims to make it easier for upstream libraries to do that as they don't have to copy/replicate the work done here.
Practically unix is a package for everything which is not Windows. E. g., WASM is not a POSIX-compliant platform, and almost every OS has its own set of quirks violating POSIX.
If the conditional high-level API you proposed would be accepted as a contribution I'd be more than happy to implement those.
I'd accept such contribution.
@mmhat are you interested to work on a high-level API here?
I don't think it's appropriate for this PR to provide the high-level API. We're going to burn out the contributor this way. We should decide whether we merge this PR mostly as-is and expose it as internal API.
This has a :+1: from me.
We're going to burn out the contributor this way.
That's a very sensible concern, but @mmhat originally said that "If the conditional high-level API you proposed would be accepted as a contribution I'd be more than happy to implement those". It seems a win-win to me.
We're going to burn out the contributor this way.
That's a very sensible concern, but @mmhat originally said that "If the conditional high-level API you proposed would be accepted as a contribution I'd be more than happy to implement those". It seems a win-win to me.
@Bodigrim Luckly I have a bit more time to work on this until the end of the year so I could do that.
Why not conditionally provide a new high-level Linux/glibc-specific API that returns a directory entry and the file type. The conditional test here would be the availability of the
DT_REGandDT_UNKNOWNentry types. The module would then also expose pattern synonyms for the expected types.I think this would be cleaner/safer than exposing the proposed API. As for avoiding allocation of a dirent structure, with all the attendant syscall overhead, the memory GC cost should be fairly modest.
So I am not entirely enthused by this. Which is not a hard no, but significant scepticism.
@Bodigrim @hasufell I added some high-level API as outlined in the comment by @vdukhovni. I also moved readDirStreamWith and readDirStreamPtr to the internal API. Can you have a quick look if this first draft matches your expectations?
I just stumbled over this:
- https://github.com/haskell/ghcup-hs/issues/766
- https://stackoverflow.com/a/29628894
So... DirType will not work as expected on some filesystem. How do we handle that? It is as such ~~unsafe~~ non-portable.
Another reason to not expose the low-level API.
In ghcup I've written a portable version of readDirStream: https://github.com/haskell/ghcup-hs/pull/767/commits/53e324bfee681290c87d467d54142a9c15bba387#diff-2d9015010bcf8511a8be4190d7203f50fea2ce7f5b04d3f2d44bea2b93914ce1R116
This is similar to what other projects have been doing (falling back to 'stat' on 'DT_UNKNOWN'): https://github.com/ggreer/the_silver_searcher/pull/37
Another reason to not expose the low-level API.
I kindly disagree: IMHO, when it comes to directory tree traversal, the functions we provide in this package are already low-level. The user needs to understand concepts like a directory stream and we throw a bunch of low-level technical terms at them in the Haddocks (like dirent, d_name, ...). I think it is reasonable to assume that they are familiar with all that stuff in general and that exposing DirType does not add another level of "low-leveliness" that should be hidden.
To resolve this issue with unsupported filesystems I see the following options:
- We decide that exposing
DirTypecomes indeed with too much limitations and we do not want to support it in this library. Hence this contribution here is rejected. - We expose the
readDirStreamWithTypefunctions andDirTypeas part of an internal API. That would probably mean to create modules likeSystem.Posix.Directory.Internals.ByteStringand move those functions there. - We expose the
readDirStreamWithTypefunctions andDirTypeas part of an public API as it is currently done in this PR.
If we go with (2) or (3) I'd like to add some documentation pointing out what the limitations of DirType are. In addition to that I could add a portable version of readDirStreamWithType like @hasufell did for ghcup-hs or change it to readDirStreamWithType :: Bool -> ... where the Bool indicates whether to use stat on DT_UKNOWN or not. Another option is to return a Maybe DirType where Nothing means DT_UNKNOWN.
Personally, I'd prefer option (3) with readDirStreamWithType as it is. Also, I'd rather add the Bool argument than introduce another (portable) version of readDirStreamWithType.
where the
Boolindicates whether to usestatonDT_UKNOWN
When would someone pass False here? I believe we can expose the non-portable version in a module that's clearly marked non-portable. I think otherwise the unix package really strives to be... well, portable.
No need for a Bool option imo.
where the
Boolindicates whether to usestatonDT_UKNOWNWhen would someone pass
Falsehere?
Good point; I always thought of the readDirStream functions as wrappers around readdir, hence being unopinionated about fallback behavior and leave it up to the user to decide what to do if a filesystem doesn't support certain functionality. So far for the theory, I admit I don't have a use case in mind where somebody would pass False.
There is however a corner case I am not sure how to deal with in a fallback implementation: Suppose the underlying platform does not properly support d_type and some of DT_UNKNOWN, DT_REG, ... are not defined in dirent.h. Then the corresponding constants in this package will be all be -1 due to FP_CHECK_CONSTS in the configure.ac and a fallback implementation like the one linked in https://github.com/haskell/unix/pull/251#issuecomment-1416136882 won't save us as the constructed DirType value is meaningless.
There is however a corner case I am not sure how to deal with in a fallback implementation: Suppose the underlying platform does not properly support d_type and some of DT_UNKNOWN, DT_REG, ... are not defined in dirent.h. Then the corresponding constants in this package will be all be -1 due to FP_CHECK_CONSTS in the configure.ac and a fallback implementation like the one linked in #251 (comment) won't save us as the constructed DirType value is meaningless.
We should be able to check for DT_UNKNOWN or -1 in the stat-fallback, no?
@hasufell I guess I didn't explain my concerns very well; What I meant was: Suppose neither DT_REG nor DT_DIR are defined, hence pattern RegularFileType = DirType (-1) and pattern DirectoryType = DirType (-1). We encounter a regular file in the directory stream and since no known DirType pattern matches we decide to stat it. Since it is a regular file we return RegularFileType to the user. But when the user checks if the entry is a directory, i.e. case dtype of { DirectoryType -> ...; RegularFileType -> ...}, this check will be successful as RegularFileType == DirectoryType.
The (well, obvious) solution to that is to assign different (negative) values to CONST_DT_* if those are missing on the underlying platform. I implemented that and the desired fallback behavior and added a new 'DirStreamWithPath` type that stores the filepath that corresponds to the directory stream. Please let me know what you think of it.
BTW: The CI is currently failing with "Error: All install methods for ghc 8.10.7 failed". I didn't look into it yet, but I suspect https://github.com/haskell/unix/pull/270 will resolve this issue?
@mmhat Seems intermittent, CI is functional again. Please fix a build error in the GHC 8.2 build.
@hasufell @vdukhovni I'll leave this for you to review, I'm absolutely illiterate in this area.
For now I'd be inclined to NOT automatically simulate availability of file types in the directory entry. Instead, just return unknown/unavailable, and let the caller decide whether and how to perform the additional stat.
That's not a high-level API anymore, which is what you were asking for if I remember correctly.
I don't like APIs that are in-between.
Either expose the syscalls with minimal wrapper and all the caveats or provide a high-level API that is as cross platform as it can be.