fdir icon indicating copy to clipboard operation
fdir copied to clipboard

Compare to native options

Open vjpr opened this issue 2 years ago • 22 comments

I am interested in how much faster it could be using native code. Via child_process or N-API.

Someone already mentioned Rust's https://github.com/jessegrosjean/jwalk. Some other Rust discussion here. Would be cool to have a napi-rs wrapper on it and see the speed diff.

Rust

What’s the fastest way to read a lot of files?

C

JS Glob Libs

  • https://github.com/terkelg/tiny-glob
    • Uses globrex

    ~350% faster than node-glob and ~230% faster than fast-glob

  • https://github.com/mrmlnc/fast-glob
    • Uses micromatch, can also use picomatch

JS Matcher Libs

  • isaacs/minimatch
  • micromatch/micromatch
  • micromatch/picomatch
    • Comparison - https://github.com/micromatch/picomatch#library-comparisons
  • globrex

WASI

No implementation exists yet.

Would be good to try with AssemblyScript.

vjpr avatar Oct 15 '21 20:10 vjpr

I considered going native before but the benefit of a 0 dependency purely Javascript library is far greater — I don't have to deal with platform differences or worry about building the module in all platforms. NAPI is a nightmare. Right now, fdir just works and it is fast. Aside from that, I am also not proficient enough in Rust or C or C++ to write high performance code. I am sure native would be faster.

I would love to benchmark of fdir against other native options. That'd be actually pretty awesome. PRs are welcome :smile:

thecodrr avatar Oct 16 '21 03:10 thecodrr

Alright, curiosity got the better of me and I benchmarked jwalk with fdir using napi-rs.

Code

I am no expert in Rust so this code is probably sub-par.

#![deny(clippy::all)]

#[macro_use]
extern crate napi_derive;

use napi::{CallContext, JsObject, JsUndefined, Result};

use jwalk::{Parallelism, WalkDir};

#[module_exports]
fn init(mut exports: JsObject) -> Result<()> {
    exports.create_named_method("jwalk", jwalk)?;
    Ok(())
}

#[js_function(1)]
fn jwalk(ctx: CallContext) -> Result<JsUndefined> {
    // let dir = ctx.get::<JsString>(0)?.into_utf8()?;

    let mut entries: Vec<String> = vec![];

    for entry in WalkDir::new("/")
        .parallelism(Parallelism::RayonNewPool(8))
        .skip_hidden(false)
    {
        match entry {
            Ok(entry) => entries.insert(entries.len(), entry.path().display().to_string()),
            Err(_err) => {
                println!("failed to access entry");
            }
        }
    }
    ctx.env.get_undefined()
}

I tried to be as fair as possible but my limited knowledge of Rust probably doesn't help. Maybe @Brooooooklyn can take a look and advise some improvements to the Rust code.

The benchmark code:

const { fdir } = require("./index.js");
const { jwalk } = require("fdir-bench");

(async () => {
  console.time("fdir");
  for (let i = 0; i < 5; ++i) {
    await new fdir()
      .withBasePath()
      .withDirs()
      .crawl("/")
      .withPromise();
  }
  console.timeEnd("fdir");
  console.time("jwalk");
  for (let i = 0; i < 5; ++i) {
    jwalk("/");
  }
  console.timeEnd("jwalk");
})();

Of course, this is very simplistic but it does the job.

Results

fdir: 7.795s
jwalk: 6.958s

Number of entries found:

fdir: 1685964 (dirs & files)
jwalk: 1689508 (dirs & files)

Thoughts

Well, fdir is pretty fast. Only 1.2s slower than jwalk. I am certain after a few optimizations fdir can beat jwalk.

thecodrr avatar Oct 19 '21 14:10 thecodrr

Try this code to avoid re-allocated in Rust

let result: Vec<String> = WalkDir::new("/")
  .parallelism(Parallelism::RayonNewPool(8))
  .skip_hidden(false)
  .filter_map(|dir_entry| dir_entry.ok().map(|entry| entry.path().display().to_string().to_string()))
  .collect();

Brooooooklyn avatar Oct 19 '21 14:10 Brooooooklyn

@Brooooooklyn

Unfortunately getting this:

error[E0599]: the method `filter_map` exists for struct `WalkDirGeneric<((), ())>`, but its trait bounds were not satisfied
   --> src/lib.rs:25:10
    |
25  |         .filter_map(|dir_entry| {
    |          ^^^^^^^^^^ method cannot be called on `WalkDirGeneric<((), ())>` due to unsatisfied trait bounds
    | 
   ::: /home/thecodrr/.cargo/registry/src/github.com-1ecc6299db9ec823/jwalk-0.5.1/src/lib.rs:158:1
    |
158 | pub struct WalkDirGeneric<C: ClientState> {
    | ----------------------------------------- doesn't satisfy `WalkDirGeneric<((), ())>: Iterator`
    |
    = note: the following trait bounds were not satisfied:
            `WalkDirGeneric<((), ())>: Iterator`
            which is required by `&mut WalkDirGeneric<((), ())>: Iterator`

thecodrr avatar Oct 19 '21 14:10 thecodrr

Ah, had to do into_iter.

thecodrr avatar Oct 19 '21 14:10 thecodrr

Any difference in results?

Also might be useful to add a pattern filter such as '**/package.json'.

vjpr avatar Oct 19 '21 20:10 vjpr

Did some testing of my own.

MacBook Pro 16 - 2.4 GHz 8-Core Intel Core i9

jwalk - cargo run (unoptimized + debuginfo)

16 threads = 3738ms

jwalk - cargo run --release

8 threads = 2363ms 16 threads = 1732ms

jwalk - napi (returning results)

817,100 files 16 threads = 2369ms

fdir

773,467 files 2340ms

jwalk - cargo run - filter node_modules 300ms

jwalk - cargo run - filter node_modules + only package.json ...

fdir - filter node_modules 968ms

fdir - filter node_modules + only package.json 1020ms


It's actually very impressive how close they are considering one is running natively...for a large number of files.

Although when globbing is involved, jwalk is winning by quite a bit.

Tried using both .glob and .filter to search for package.json and I couldn't get it below 900ms.

I also found that using the Rust library fd I can get 180ms which is impressive. Seems like all these libraries are adding a lot of overhead somewhere.

time fd package.json > /dev/null
fd package.json > /dev/null  0.17s user 0.18s system 824% cpu 0.043 total

fd seems to use the ignore package to walk dirs (which is actually used in ripgrep) which jwalk is suppose to be faster than.

My use case is to speed up the pnpm package manager reading all package.json files on startup. Currently they use fast-glob which uses micromatch. Because this has to run on startup, 200ms makes a big difference compared 1s.


Going to try out ignore as a native lib. Seeing if there is a good api to use: https://github.com/BurntSushi/ripgrep/discussions/2028

Could also re-use the glob libs from ignore/ripgrep with jwalk and it should be fast enough. jwalk has a simple parallel API.

Maybe worth looking at using tiny-glob (globrex) for fdir as it claims to be a lot faster.

vjpr avatar Oct 20 '21 02:10 vjpr

@vjpr Can you share your benchmark code? I'd like to take a look and see what is causing that huge bottleneck.

thecodrr avatar Oct 20 '21 05:10 thecodrr

After a couple "optimizations", here are the results:

jwalk: 2.6862s
ignore: 5.747s
fdir: 3.2936s

These results are average of 5 iterations on the root directory containing 1.6million files & about 300k directories.


jwalk is very fast. fdir gets really close but not quite there. I haven't tried using worker_threads yet for directory traversal so perhaps there is opportunity for performance there. ignore is slow. I have no idea how fd is getting such performance? They say they are doing it in parallel which is also being done by jwalk (via rayon) & fdir (via libuv).

thecodrr avatar Oct 20 '21 13:10 thecodrr

@vjpr So I went ahead and did a benchmark with fd which you claimed to do everything within 200ms. Here are the results:

time fdfind --hidden --no-ignore package.json / > /dev/null

real    0m1.203s
user    0m4.639s
sys     0m4.378s

And I did the same thing with fdir:

time UV_THREADPOOL_SIZE=2 NODE_ENV=production node test.js

real    0m1.226s
user    0m2.097s
sys     0m1.464s

Both returned around the same amount (26k) of results.

Now, maybe I did it wrong but clearly fdir is faster than or about the same as fd in terms of speed. Of course, fd is doing plenty of other things so it isn't exactly fair but still...

Here's the code for the above fdir script:

  await new fdir()
    .withBasePath()
    .filter((path) => path.indexOf("package.json") > -1)
    .crawl("/")
    .withPromise();

One thing I noticed: using UV_THREADPOOL_SIZE=2 gave the best performance. Although it wasn't that much worse with default threadpool size either but if I increased it to 8, the crawling speed went down 2x.

thecodrr avatar Oct 20 '21 14:10 thecodrr

@thecodrr What repo were you using to test?

Also just to note, default UV_THREADPOOL_SIZE=4.

vjpr avatar Oct 20 '21 14:10 vjpr

I didn't use any repo. I ran it in my root. Yes, that is 1.6 million files & 300k directories — bigger than 99% of the repositories which means more pattern failures. If you run it inside fdir repo, it completes in under 12-13 ms.

Which repo did you test on?

Also just to note, default UV_THREADPOOL_SIZE=4.

Yes, I know. That's what surprised me. No idea why size of 2 gave a slightly better perf. The difference is too negligible on small directories though.

thecodrr avatar Oct 20 '21 14:10 thecodrr

I'm using a large internal monorepo. We should find a nice and large public Node.js monorepo to use for tests.

So for me, increasing UV_THREADPOOL_SIZE=8, fdir starts beating jwalk, which is cool!

What machine are you using?

I'm still yet to test out ignore with crossbeam parallelism which seems to get 200ms.

vjpr avatar Oct 20 '21 14:10 vjpr

I'm using a large internal monorepo. We should find a nice and large public Node.js monorepo to use for tests.

What do you suggest? React? Jest?

So for me, increasing UV_THREADPOOL_SIZE=8, fdir starts beating jwalk, which is cool!

Very interesting.

What machine are you using?

Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz with 16 GB RAM and of course an SSD (a mediocre one). And I am on Linux.

thecodrr avatar Oct 20 '21 14:10 thecodrr

So ignore with parallel is about 200ms faster than fdir. 12 threads.

fn get_overrides(source: String) -> Override {
  let mut override_builder = OverrideBuilder::new(source.clone());

  let exclude_patterns = vec!["**/package.json", "!**/node_modules/**"];

  for pattern in exclude_patterns {
    override_builder
      .add(pattern)
      .map_err(|e| anyhow!("Malformed exclude pattern: {}", e));
  }
  let overrides = override_builder
    .build()
    .map_err(|_| anyhow!("Mismatch in exclude patterns"));

  overrides.unwrap()
}

pub fn walk_parallel(source: String) -> Vec<String> {
  let overrides = get_overrides(source.clone());

  let (tx, rx) = mpsc::channel();
  WalkBuilder::new(source)
    .hidden(false)
    .overrides(overrides)
    .standard_filters(false)
    .threads(cmp::min(12, num_cpus::get()))
    .build_parallel()
    .run(move || {
      let tx = tx.clone();
      Box::new(move |dir_entry_result| {
        if let Ok(dir_entry) = dir_entry_result {
          tx.send(dir_entry.file_name().to_owned()).unwrap();
        }
        ignore::WalkState::Continue
      })
    });
  let mut metadatas: Vec<_> = rx.into_iter().map(|dir| dir.into_string().unwrap()).collect();
  //metadatas.sort_by(|a, b| a.len().cmp(&b.len()))
  metadatas
}

I'm putting together a repo now.

vjpr avatar Oct 20 '21 15:10 vjpr

Linux

There was some discussion here about macOS APFS being slower: https://users.rust-lang.org/t/whats-the-fastest-way-to-read-a-lot-of-files/39743/4?u=vjpr

...leads me to the conclusion that APFS obtains a global kernel lock during read-only operations such as readdir(). In other words, APFS slows down when attempting to perform parallel read-only I/O. ... It is apparent that macOS 10.14 Mojave has received performance work relative to macOS 10.13 https://gregoryszorc.com/blog/2018/10/29/global-kernel-locks-in-apfs/ https://news.ycombinator.com/item?id=18332260

Although could be just per-directory lock for readdir as someone mentions.

vjpr avatar Oct 20 '21 15:10 vjpr

So ignore with parallel is about 200ms faster than fdir. 12 threads.

That's a lot of effort :laughing:. I put together a very, very rough parallel directory crawling in node using clusters to get 1.4s. Without clusters, I was getting around 6 seconds. fdir can touch 1.4s without clusters. Guess where fdir would be after proper parallelism...

thecodrr avatar Oct 20 '21 16:10 thecodrr

Nice. Cluster seems like the way to go to reach the Node.js max perf. Although for CLI cold-start use cases, would depend on how long it takes the process pool to spin up which might negate the benefit.

My particular use case is 'pnpm' package manager. Many commands need to read the entire tree on startup, so if i could get that down to <500ms for large monorepos it would be great.

FYI, not sure workers will help.

Workers (threads) are useful for performing CPU-intensive JavaScript operations. They do not help much with I/O-intensive work. The Node.js built-in asynchronous I/O operations are more efficient than Workers can be. https://nodejs.org/api/worker_threads.html#worker-threads

vjpr avatar Oct 20 '21 17:10 vjpr

so if i could get that down to <500ms for large monorepos it would be great.

You'll have to give an example of a large monorepo so I can test it. To the best of my knowledge, you can easily traverse a fairly large monorepo under 500ms. That's excluding symlink resolution. I haven't tested with symlink resolution but it should be much, much slower.

FYI, not sure workers will help.

Yes, but child_process might. Have to check and see. I don't expect a huge difference mostly because libuv has all the threads maxed out most of the time. But doesn't hurt to try.

Although for CLI cold-start use cases, would depend on how long it takes the process pool to spin up

This is a real concern.


The main bottleneck is readdir and as far as I know, there are no faster alternatives in node yet. At the end of the day, everything is limited by IO so there is a point after which the quest for performance becomes foolish.

thecodrr avatar Oct 20 '21 18:10 thecodrr

The main bottleneck is readdir and as far as I know, there are no faster alternatives in node yet.

fs.readdir is the most direct mapping to the underlying syscall readdir(3), so this will always be a limit in a Node env.

The only potential alternative would be WASI, but the Node.js implementation just uses libuv under the hood anyway. However when using WASI the regex logic could be moved into WASM which might improve things. AssemblyScript could be used to write the WASM in TypeScript rather than needing a syntax change.

If we don't restrict ourselves to cold-start, we could use a persistent daemon that watches the file system for changes. This is a way to get instant speed.

  • a. On file change, re-walk the entire tree. CLI queries daemon for instant response.
    • High background CPU load.
  • b. On file change, run matchers on only the changed files. CLI queries daemon for instant response.
    • Less background CPU load.
  • c. On CLI start, load a cache, query daemon for file changes, and apply. I use FB's watchman package for this at the moment.
    • No background CPU load.

I think that covers the entire solution space for reading dir trees :D

vjpr avatar Oct 20 '21 22:10 vjpr

My benchmarks:

robocopy.exe with /log:output.txt - 7s fdir - 2 minutes

On a volume with around 1,000,000 files.

PlkMarudny avatar Aug 03 '23 08:08 PlkMarudny

@PlkMarudny whats your code for fdir and what are you benchmarking through?

thecodrr avatar Aug 03 '23 20:08 thecodrr