fusion icon indicating copy to clipboard operation
fusion copied to clipboard

add filewalks

Open timotheecour opened this issue 5 years ago • 14 comments
trafficstars

migrated from https://github.com/nim-lang/Nim/pull/15598 (refs https://github.com/nim-lang/Nim/pull/15598#issuecomment-714356421)

  • works at CT pending https://github.com/nim-lang/Nim/pull/15705, hence the NimPatch bump to facilitate this (and yes, people care about walking at CT, eg see recent https://forum.nim-lang.org/t/6984)

links

  • https://stackoverflow.com/questions/466521/how-many-files-can-i-put-in-a-directory/466596#466596

timotheecour avatar Oct 24 '20 06:10 timotheecour

But you already have globs on stdlib, I think we should stop adding 10 walks to Nim, try to fix stdlib globs instead please and do not add even more, DRY, ideally we should have 2 or 3 walks max but that work properly, even if it needs to have when defined(): blocks.

Code looks Ok but still...

juancarlospaco avatar Oct 27 '20 14:10 juancarlospaco

@juancarlospaco no, you missed the discussion, addition of globs to the stdlib was rejected because for this module it makes more sense to put it into Fusion. See https://github.com/nim-lang/Nim/pull/15598

Yardanico avatar Oct 27 '20 15:10 Yardanico

@Yardanico https://nim-lang.github.io/Nim/globs.html

juancarlospaco avatar Oct 27 '20 15:10 juancarlospaco

@juancarlospaco seems like a bug though, it wasn't merged. You can see that it's not in https://github.com/nim-lang/Nim/tree/devel/lib/std

Yardanico avatar Oct 27 '20 15:10 Yardanico

@Yardanico nim-lang.github.io/Nim/globs.html

this is from lib/std/private/globs.nim which I had added in https://github.com/nim-lang/Nim/pull/14501 for kochdocs, it's in private hence not intended for public use, just like since, etc. nim docs should make that clear (still useful to show docs for those, just like we have docs for compiler).

But you already have globs on stdlib, I think we should stop adding 10 walks to Nim,

since it's not public and I started relying on it from outside (in fusion in https://github.com/nim-lang/fusion/pull/24 do add doc generation to fusion) as well as for future intended use, it was time to make it a proper module, hence my original attempt to add it to stdlib here https://github.com/nim-lang/Nim/pull/15598. Note that making it a separate nimble project (or using pkg/globs) would not allow using it in either kochdocs nor from within fusion.

I wrote this because the existing walkDirs, walkDir, walkDirRec are not flexible enough, in particular walkDirRec doesn't allow efficiently traversing large directory structures with a filter (a very common use case), and lack important options commonly found in other libraries or tools (eg linux find program) and its API is lacking (eg yielded paths don't specify their kind). It would also allow simplifying code and avoiding common pitfalls when re-implementing such functionality in other tools eg see nimgrep PR https://github.com/nim-lang/Nim/pull/15612#discussion_r509826384

See also design discussion and comparison against pkg/globs in https://github.com/nim-lang/Nim/pull/15598, it was designed with pkg/globs in hindsight.

walkFiles is designed to be both simple to use yet general and flexible enough to fit most use cases thanks to optional callbacks.

timotheecour avatar Oct 27 '20 17:10 timotheecour

ping @Araq ; generally useful and would help with things like nimgrep (eg https://github.com/nim-lang/Nim/pull/15962#issuecomment-727086184) which now has to change to slightly slower closure iterators to avoid code bloat from multiple yields

as well as in other places, eg would avoid the ugly import std/private/globs from within fusion which i did as a least worst option in https://github.com/nim-lang/fusion/pull/24

timotheecour avatar Nov 13 '20 23:11 timotheecour

LGTM :+1:

juancarlospaco avatar Dec 10 '20 00:12 juancarlospaco

PTAL, rebased for conflict bitrot

timotheecour avatar Jan 15 '21 20:01 timotheecour

replying to @araq https://github.com/nim-lang/Nim/pull/16709#issuecomment-762416829

What I dislike about your walkPathsOpt is the callbacks, iterators that require callbacks are alien.

there is precedent in stdlib, eg sequtils.filter: iterator filter*[T](s: openArray[T], pred: proc(x: T): bool {.closure.}): T =, and I don't see what's wrong with callbacks in iterators.

Filtering should always be an if inside the iteration

that's not what follow: FollowCallback is about. follow controls whether to recurse down a directory, not whether to yield an entry. You cannot emulate this in user code in iterator body, eg:

for e in walkPaths(src):
  if e.path.lastPathPart != ".git": doSomeWork(e)

this would be very inefficient as it'd recurse down ".git" only to be filtered out later by the iterator body.

SortCmpCallback also cannot be emulated in user code in the iteration body, nor can the BFS vs DFS iteration order.

I intentionally did not include a yieldFilter because leaf-level filtering should be in iteration body, see this comment in the PR:

* a yieldFilter, regex match etc isn't needed because caller can filter at
  call site, without loss of generality, unlike `follow`; this simplifies the API.

, else you might as well make the iterator a proc that takes callbacks.

that would give a crippled, foreign-looking API, that doesn't interoperate with stdlib, eg: with this PR, this works:

import std/[sequtils,sugar]
let s1 = toSeq(walkPaths(dir))
let s2 = collect: (for e in walkPaths(dir): e.path)

with a proc, this wouldn't work. Likewise, iterator allows the usual break/continue/return/yield control flow, a proc doesn't (https://github.com/nim-lang/Nim/pull/15655 is a separate topic).

Also, the new better walkPathsOpt should be in a module of its own.

that's already the case in this PR (and was also the case in the stdlib version of this PR https://github.com/nim-lang/Nim/pull/15598)

timotheecour avatar Jan 18 '21 20:01 timotheecour

So... how this is going ?.

juancarlospaco avatar Apr 29 '21 21:04 juancarlospaco

So... how this is going ?.

@Araq this should be moved back to a PR against stdlib (as was intially the case refs https://github.com/nim-lang/Nim/pull/15598). This is 100% stdlib territory, general purpose enough, and would benefit existing code in tools/ and elsewhere that reimplement this (poorly, with duplicate yield and worse API)

The main thing missing in this API is resumable error handling (via a callback to handle IO errors), but this can be done in followup work.

timotheecour avatar Apr 29 '21 22:04 timotheecour

The API is alien, we use named parameters instead of the "object builder" pattern. It's also not clear to me why "sorting" is important, looking at https://docs.python.org/3/library/glob.html for a comparison I see none of these shenanigans. Also, ideally new APIs embrace the AbsoluteFile/Dir/etc design, I'm tired of type-unsafe path handling code.

The main thing missing in this API is resumable error handling (via a callback to handle IO errors), but this can be done in followup work.

For me it's the one problem of the old API that is worth solving.

Araq avatar Apr 30 '21 11:04 Araq

just replying on this point for now, will followup with other points later:

It's also not clear to me why "sorting" is important

because it's easy to support efficiently in the iterator (as i did) with O(N) space where N=max entries per dir, impossible to support efficiently in the callee (it'd require O(P) where P = total number of files/dirs listed recursively), and most of all because it's useful; a quick google search reveals people keep searching for this (with bad solutions, involving having to first get all results and then sort)

here's a post from today: https://discord.com/channels/371759389889003530/371759389889003532/844205260049612820

Clonkk[Matrix]
BOT
 — Today at 6:30 AM
Is it possible to order walkDir alphabetically ?
Vindaar — Today at 6:32 AM
I suppose no, because it's probably yields by inode number or something like this. I guess you will have to store the files in a temp seq and sort that instead

(although that person talks about walkDir, where caller can easily solve this with a toSeq.sorted, the main problem is recursive listing)

timotheecour avatar May 18 '21 16:05 timotheecour

because it's easy to support efficiently in the iterator (as i did) with O(N) space where N=max entries per dir, impossible to support efficiently in the callee

That's convincing, thanks. However, we could simply always enable it and document that it does sort.

Araq avatar May 19 '21 05:05 Araq