eslint icon indicating copy to clipboard operation
eslint copied to clipboard

Lint multiple files in parallel [$500]

Open ilyavolodin opened this issue 9 years ago • 169 comments

This is a discussion issue for adding ability to run eslint in parallel for multiple files.

The idea is that ESLint is mostly CPU bound, not IO bound, so creating multiple threads (for machine with multiple cores) might (and probably will) increase performance in a meaningful way. The downside is that currently ESLint's codebase is synchronous. So this would require rewriting everything up to and including eslint.js to be asynchronous, which would be a major effort.

I played with this a little while ago and found a few libraries for Node that handle thread pool, including detection of number of cores available on the machine.

  • Node-threads-a-gogo - seems pretty good, but looks dead.
  • nPool - seems actively in development, but has native components (C++)
  • Node WebWorkers - seems pretty dead too.
  • Parallel - seems dead, and no pool implementation.
  • Node Clusters - not stable yet, and probably isn't going to be available on Node v0.10
  • WebWorkers - seems that they are only implemented in io.js And there are a ton of other libraries out there for this.

If anyone had any experience writing multithreaded applications for node.js and would like to suggest alternatives or comment on the above list, please feel free.

P.S. https://www.airpair.com/javascript/posts/which-async-javascript-libraries-should-i-use

Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.

ilyavolodin avatar Aug 28 '15 18:08 ilyavolodin

Nice. I'm very interested in trying this myself too.

BYK avatar Aug 28 '15 19:08 BYK

Another question that I had in my mind, if we need to rewrite everything to be async, should we use callback pattern? Promises? If so which library? Q or Bluebird? I personally would prefer promises to callback hell.

ilyavolodin avatar Aug 28 '15 19:08 ilyavolodin

I vote for promises. Bluebird is fast, but makes me nervous because it adds methods to the native Promise, which is likely not a great idea. I think Q might be the best bet. There is also Asequence, but I have no personal experience with it.

IanVS avatar Aug 28 '15 19:08 IanVS

Why not use built in promises? Just a question as I have no experience with promises yet.

gyandeeps avatar Aug 28 '15 19:08 gyandeeps

They're not supported in node 0.10, to my knowledge.

Besides that, the libraries give some nice "sugar" methods when working with Promises.

IanVS avatar Aug 28 '15 19:08 IanVS

I've had plenty of success using native promises (or a polyfill when native promises aren't supported). That seems like a good starting point to me; if we need more than they provide we could probably swap out something that's API-compatible.

btmills avatar Aug 28 '15 20:08 btmills

I think we're putting the cart before the horse here. Let's hold off on promises vs. callbacks until we're at least ready to prototype. Get something working with callbacks and let's see how bad it is (or not).

nzakas avatar Aug 28 '15 20:08 nzakas

The idea is that ESLint is mostly CPU bound, not IO bound

ESLint also does a lot of IO (directory traversal, reading source files). So I think we would also profit here if we rewrite eslint to do non-blocking IO.

lo1tuma avatar Aug 28 '15 21:08 lo1tuma

@lo1tuma I haven't profiled it yet, but in my mind, amount of IO we do is negligible comparing to amount of CPU cycles we eat. I will try to profile it and post results here if I will get anything meaningful.

ilyavolodin avatar Aug 28 '15 23:08 ilyavolodin

Using something like NodeClusters - or most other per-process implementations - would avoid the issue of needing to [vastly] rewrite ESLint. (Such implementations are strictly not threading, but allow managed parallel process execution.)

It would mostly just need to IPC-in/out the current ESLint; ESLint parallelism would then be able to work freely over different files in a per-process manner, but it could not (without more rework) run concurrently over a single file.

Thus if the goal is to run ESLint over different files in parallel I would urge such a 'simple' per-process concurrency approach. If the goal is to make ESLint parallel across the same source/AST then .. that is a more complicated can of worms as it changes the 'divergence' point.

If there is a strict v0.10 node target for ESLint, maybe have this as an feature when running a compatible node version.

pnstickne avatar Aug 29 '15 16:08 pnstickne

My idea is:

  • The master process does queue control and collecting result.
  • Worker processes have a short queue (the length is about 2-4). A worker does:
    • Sends the result of the last file and requests a path of the next next next file to the master.
    • Reads asyncly the next next file.
    • Lints the next file.

There are source codes which has a variety of size, so I think the queue control in pull-style is important.

Library.... I think child_process.fork is enough.

mysticatea avatar Aug 29 '15 16:08 mysticatea

Sorry for this question: Based on all the comments, do we think the reward of this functionality is worth the effort/changes/complications? (like I said just a question) Or is this functionality too early (like google glass) to implement without any actual estimates or ask or whatever.

gyandeeps avatar Aug 29 '15 16:08 gyandeeps

@gyandeeps My projects are not big enough, or computer slow enough, for me to really care either way.

In cases where there are sizable projects of many files, on computers with several cores and non-bound I/O, I imagine that it could lead to significantly reduced wall-clock, approaching Amdal's law. I would be less optimistic about this gain with fewer larger files, even with 'actual threading' or post-AST handling - but that is what performance profiles are for.

Of course another option is to only lint 'changed' files and provide some sort of result cache, but comes with additional data management burdens.

pnstickne avatar Aug 29 '15 17:08 pnstickne

@gyandeeps To answer your question - we do not have definitive information on this. Right now my assumption is that we are CPU-bound, not IO. In that case utilizing more cores should have significant impact on larger projects (as @pnstickne mention, impact will be small or there might even be negative impact on a few large files). I think the first step of the process would be to prove or disprove my assumption on CPU vs IO. Granted, if it turns out I'm wrong and we are IO-bound, then changing ESLint code to async would improve performance anyways.

@pnstickne Thanks for the insights. I'm really not familiar with NodeClusters, just know that they exist and do not require everything to be async for them to act. We definitely not going to try to make ESLint run AST analysis in parallel, that's going to be a huge change, and a lot of our rules expect nodes to be reported in the right order, so we would not only need to rewrite pretty much the whole core, we would also need to rewrite a lot of rules. I would be interested in learning more how we could incorporate code that would only execute on node 0.12 and not on earlier version. I'm not sure that's the approach I would like to take, but it's an interesting option never the less.

ilyavolodin avatar Aug 29 '15 17:08 ilyavolodin

So I did some very unscientific performance analysis on both Windows and OSX. I specifically chose to lint a very large number of files (TodoMVC, 1597 files, 854 folders). On Windows, results where very inconclusive. Basically my CPU usage was never over 15% (on 8 core machine, none of the cores were overtaxed at any point). But then my I/O never hit anything over 10MB/s on a SSD that's theoretically capable of 550 MB/s. So I have no idea why it didn't run any faster. On OSX it never hit more then 15% CPU (I have no idea how many cores my macbook has, probably 8 too), but I/O was pretty taxed. It looked like there were a few spikes that were reaching 100% disk throughput. So maybe my assumption was wrong any we are not CPU bound?

ilyavolodin avatar Aug 30 '15 01:08 ilyavolodin

@ilyavolodin Try this:

Start running 8 different eslint processes (over the same set of files should be fine, although it would be 'more fair' to break up the files into 8 different equally-sized sets).

Compare the wall-clock times it takes for 1 process to complete and 8 processes to complete.

The 8 processes will have done 8x the work (if using the same set of files for each process as for the single process) or the same amount of work (if having split the source files among them), but in how much x the time?

This very crudely should show an approximate gain - if any - for using multi-process concurrency.

pnstickne avatar Aug 30 '15 01:08 pnstickne

Late to the conversation, but... Is anyone opposed to someone starting up a pull request to implement the callback hell approach (i.e., make ESLint async across the board, including for all I/O operations)? Seems like it would make the eventual parallelization easier anyway.

platinumazure avatar Aug 30 '15 01:08 platinumazure

Of course another option is to only lint 'changed' files and provide some sort of result cache, but comes with additional data management burdens. -@pnstickne

This is essentially what ESLinter from @royriojas does.

IanVS avatar Aug 30 '15 01:08 IanVS

@IanVS That's pretty cool .. now if only it was built-in to my eslint grunt task :}

(Okay, I could get it shimmed in pretty easy - but it'd still be nice to see a 'done package'.)

pnstickne avatar Aug 30 '15 01:08 pnstickne

@pnstickne #2998 @platinumazure I think I would like to first prove that there's something to gain by going async/multithreaded just so nobody would waist their time creating a very large PR that will then be closed, because there is no gain.

ilyavolodin avatar Aug 30 '15 14:08 ilyavolodin

@ilyavolodin That's fair enough. I'm wondering, though, would it be worth creating a separate issue with the goal of making ESLint's I/O operations asynchronous? Synchronous I/O is a bit of an anti-pattern in Node and it might help us figure out just how much of a breaking change parallelization would likely be. On Aug 30, 2015 9:46 AM, "Ilya Volodin" [email protected] wrote:

@pnstickne https://github.com/pnstickne #2998 https://github.com/eslint/eslint/issues/2998 @platinumazure https://github.com/platinumazure I think I would like to first prove that there's something to gain by going async/multithreaded just so nobody would waist their time creating a very large PR that will then be closed, because there is no gain.

— Reply to this email directly or view it on GitHub https://github.com/eslint/eslint/issues/3565#issuecomment-136151391.

platinumazure avatar Aug 30 '15 15:08 platinumazure

@platinumazure While sync code is node anti-pattern, in our case, we can't do anything with the code (as in parse it into AST) until we read the whole file. So if improving performance is not on the plate, then changing code to async will increase complexity of the code, but not gain us anything. It's worth testing out and I still think there's a performance gain that we can get by doing parallel execution, but I would like to get some proof of that first.

ilyavolodin avatar Aug 30 '15 15:08 ilyavolodin

@ilyavolodin reading files asynchronously doesn’t mean you read them chunk-by-chunk. While you are waiting for one file to be read, you can lint a different file which has been read already.

lo1tuma avatar Aug 30 '15 15:08 lo1tuma

Wall clock time is the only metric that matters for this evaluation. We have had people report they were using ESLint on projects with thousands of files, so if there's a benefit, it will be seen.

The thing to keep in mind is that directory traversal must be done before linting can begin. Thus, we get zero benefit from switching to an async directory traversal. We need to build up the list of files to lint and then fork off to start linting, at which point each process is doing a file read, and there's no benefit to async there either.

Overall, I'd just like to squash the "rewrite to make things async" approach. We should be able to make a change in CLIEngine to do this without a major rewrite.

nzakas avatar Aug 31 '15 01:08 nzakas

Ok, I think I finally figured out a good way to check if ESLint is CPU bound vs. IO. What I did is I ran eslint on the same set of files (544 files in 234 directories) first with just one rule on (`yoda: [2, "never"]') and then with .eslintrc file from ESLint repository (so a lot of rules on). As far as I can tell, the amount of IO should be exactly the same in both cases, but number of CPU cycles will be significantly different. This doesn't mean that IO is not a problem, but at least it's something. And then numbers where as follows:

Test #1: 9.630s
Test #2: 1:27.880s

Based on dramatic difference in performance, I want to say that we are not IO bound. Next step would be to create threaded prototype of some sort and see what the difference in performance is going to be.

ilyavolodin avatar Sep 10 '15 00:09 ilyavolodin

I'm still for 'per process' concurrency.

pnstickne avatar Sep 10 '15 06:09 pnstickne

Ok, I created a very simple prototype of "process-pool" that will spawn a bunch of processes based on the number of cores available and send files to those processes for linting. I had to turn off reporting (formatters are not invoked) but here are the numbers: Single process linting on 544 files in 234 directories took 9.585 seconds. Same files took 4.168 seconds over 8 processes (I have 8 cores). I still need to do some major cleanup on "process-pool" and everything up to and including CLIEngine needs to be turned into async code, but it looks like it might be worth it.

ilyavolodin avatar Sep 12 '15 01:09 ilyavolodin

@ilyavolodin, is your prototype public? would love to see it. Currently the cache module uses a statSync, for simplicity sake. but it would be nice to see if that can be done async as well

royriojas avatar Sep 12 '15 01:09 royriojas

@royriojas Not yet, I only wrote pooling library for child_process and even that needs some major cleanup before I would be comfortable posting it anywhere:-) I didn't convert anything to async code yet, I just added process.exit(0) once all of the linting is done (so no reporting of any kind for now). I also commented out all of the cache code for this test, because it was hard to find a spot to plug my code in with caching on.

ilyavolodin avatar Sep 12 '15 01:09 ilyavolodin

Just wanted to chime in. We have a medium size project, linting 1370 (larger) files with a good number of rules in a powerful VM, and it takes about 30 seconds. It's manageable, but painful.

akras14 avatar Sep 29 '15 21:09 akras14