fusion
fusion copied to clipboard
add filewalks
migrated from https://github.com/nim-lang/Nim/pull/15598 (refs https://github.com/nim-lang/Nim/pull/15598#issuecomment-714356421)
- works at CT pending https://github.com/nim-lang/Nim/pull/15705, hence the NimPatch bump to facilitate this (and yes, people care about walking at CT, eg see recent https://forum.nim-lang.org/t/6984)
links
- https://stackoverflow.com/questions/466521/how-many-files-can-i-put-in-a-directory/466596#466596
But you already have globs on stdlib, I think we should stop adding 10 walks to Nim,
try to fix stdlib globs instead please and do not add even more, DRY,
ideally we should have 2 or 3 walks max but that work properly, even if it needs to have when defined(): blocks.
Code looks Ok but still...
@juancarlospaco no, you missed the discussion, addition of globs to the stdlib was rejected because for this module it makes more sense to put it into Fusion. See https://github.com/nim-lang/Nim/pull/15598
@Yardanico https://nim-lang.github.io/Nim/globs.html
@juancarlospaco seems like a bug though, it wasn't merged. You can see that it's not in https://github.com/nim-lang/Nim/tree/devel/lib/std
@Yardanico nim-lang.github.io/Nim/globs.html
this is from lib/std/private/globs.nim which I had added in https://github.com/nim-lang/Nim/pull/14501 for kochdocs, it's in private hence not intended for public use, just like since, etc. nim docs should make that clear (still useful to show docs for those, just like we have docs for compiler).
But you already have globs on stdlib, I think we should stop adding 10 walks to Nim,
since it's not public and I started relying on it from outside (in fusion in https://github.com/nim-lang/fusion/pull/24 do add doc generation to fusion) as well as for future intended use, it was time to make it a proper module, hence my original attempt to add it to stdlib here https://github.com/nim-lang/Nim/pull/15598. Note that making it a separate nimble project (or using pkg/globs) would not allow using it in either kochdocs nor from within fusion.
I wrote this because the existing walkDirs, walkDir, walkDirRec are not flexible enough, in particular walkDirRec doesn't allow efficiently traversing large directory structures with a filter (a very common use case), and lack important options commonly found in other libraries or tools (eg linux find program) and its API is lacking (eg yielded paths don't specify their kind). It would also allow simplifying code and avoiding common pitfalls when re-implementing such functionality in other tools eg see nimgrep PR https://github.com/nim-lang/Nim/pull/15612#discussion_r509826384
See also design discussion and comparison against pkg/globs in https://github.com/nim-lang/Nim/pull/15598, it was designed with pkg/globs in hindsight.
walkFiles is designed to be both simple to use yet general and flexible enough to fit most use cases thanks to optional callbacks.
ping @Araq ; generally useful and would help with things like nimgrep (eg https://github.com/nim-lang/Nim/pull/15962#issuecomment-727086184) which now has to change to slightly slower closure iterators to avoid code bloat from multiple yields
as well as in other places, eg would avoid the ugly import std/private/globs from within fusion which i did as a least worst option in https://github.com/nim-lang/fusion/pull/24
LGTM :+1:
PTAL, rebased for conflict bitrot
replying to @araq https://github.com/nim-lang/Nim/pull/16709#issuecomment-762416829
What I dislike about your walkPathsOpt is the callbacks, iterators that require callbacks are alien.
there is precedent in stdlib, eg sequtils.filter: iterator filter*[T](s: openArray[T], pred: proc(x: T): bool {.closure.}): T =, and I don't see what's wrong with callbacks in iterators.
Filtering should always be an if inside the iteration
that's not what follow: FollowCallback is about. follow controls whether to recurse down a directory, not whether to yield an entry. You cannot emulate this in user code in iterator body, eg:
for e in walkPaths(src):
if e.path.lastPathPart != ".git": doSomeWork(e)
this would be very inefficient as it'd recurse down ".git" only to be filtered out later by the iterator body.
SortCmpCallback also cannot be emulated in user code in the iteration body, nor can the BFS vs DFS iteration order.
I intentionally did not include a yieldFilter because leaf-level filtering should be in iteration body, see this comment in the PR:
* a yieldFilter, regex match etc isn't needed because caller can filter at
call site, without loss of generality, unlike `follow`; this simplifies the API.
, else you might as well make the iterator a proc that takes callbacks.
that would give a crippled, foreign-looking API, that doesn't interoperate with stdlib, eg: with this PR, this works:
import std/[sequtils,sugar]
let s1 = toSeq(walkPaths(dir))
let s2 = collect: (for e in walkPaths(dir): e.path)
with a proc, this wouldn't work. Likewise, iterator allows the usual break/continue/return/yield control flow, a proc doesn't (https://github.com/nim-lang/Nim/pull/15655 is a separate topic).
Also, the new better walkPathsOpt should be in a module of its own.
that's already the case in this PR (and was also the case in the stdlib version of this PR https://github.com/nim-lang/Nim/pull/15598)
So... how this is going ?.
So... how this is going ?.
@Araq this should be moved back to a PR against stdlib (as was intially the case refs https://github.com/nim-lang/Nim/pull/15598). This is 100% stdlib territory, general purpose enough, and would benefit existing code in tools/ and elsewhere that reimplement this (poorly, with duplicate yield and worse API)
The main thing missing in this API is resumable error handling (via a callback to handle IO errors), but this can be done in followup work.
The API is alien, we use named parameters instead of the "object builder" pattern. It's also not clear to me why "sorting" is important, looking at https://docs.python.org/3/library/glob.html for a comparison I see none of these shenanigans. Also, ideally new APIs embrace the AbsoluteFile/Dir/etc design, I'm tired of type-unsafe path handling code.
The main thing missing in this API is resumable error handling (via a callback to handle IO errors), but this can be done in followup work.
For me it's the one problem of the old API that is worth solving.
just replying on this point for now, will followup with other points later:
It's also not clear to me why "sorting" is important
because it's easy to support efficiently in the iterator (as i did) with O(N) space where N=max entries per dir, impossible to support efficiently in the callee (it'd require O(P) where P = total number of files/dirs listed recursively), and most of all because it's useful; a quick google search reveals people keep searching for this (with bad solutions, involving having to first get all results and then sort)
here's a post from today: https://discord.com/channels/371759389889003530/371759389889003532/844205260049612820
Clonkk[Matrix]
BOT
— Today at 6:30 AM
Is it possible to order walkDir alphabetically ?
Vindaar — Today at 6:32 AM
I suppose no, because it's probably yields by inode number or something like this. I guess you will have to store the files in a temp seq and sort that instead
(although that person talks about walkDir, where caller can easily solve this with a toSeq.sorted, the main problem is recursive listing)
because it's easy to support efficiently in the iterator (as i did) with O(N) space where N=max entries per dir, impossible to support efficiently in the callee
That's convincing, thanks. However, we could simply always enable it and document that it does sort.