parcel icon indicating copy to clipboard operation
parcel copied to clipboard

Heavy memory usage

Open Banou26 opened this issue 3 years ago • 51 comments

🐛 bug report/question

Trying to import ffmpeg.js results in a process taking up 5GBs+ of RAM in a few seconds. The imported file is 10mb so i can understand it taking up a lot of ram but that much ?

This is something that has been recurring over my projects, parcel processes taking up a lot of ram. Do you guys have any plans on reducing the memory footprint ?

Banou26 avatar Aug 25 '20 21:08 Banou26

We're definitely working on addressing this, it's currently one of our top priorities. We're mainly focusing on stability and performance at this point, so we can release a stable version of Parcel 2

I'm able to reproduce this and will figure out why this is happening exactly, will report back in this issue.

DeMoorJasper avatar Aug 26 '20 06:08 DeMoorJasper

I've been able to mitigate this problem by passing PARCEL_WORKERS=1 to it. It seems like the memory usage has something to do with how many worker threads it's trying to run. Maybe they're duplicating work?

download13 avatar Sep 06 '20 04:09 download13

I have (it seems) exactly the same issue with ffmpeg.js and Parcel 1.12.4: it runs out of memory when I import ffmpeg and start parcel serve. Is there any existing workaround, maybe some Parcel CLI arguments?

roman-petrov avatar Sep 12 '20 06:09 roman-petrov

@roman-petrov You should try using Parcel 2 with the workaround literally on top of your comment.

I've been able to mitigate this problem by passing PARCEL_WORKERS=1 to it.

Banou26 avatar Sep 13 '20 12:09 Banou26

@Banou26 , thank you. I will try to upgrade to Parcel 2 and use PARCEL_WORKERS=1

roman-petrov avatar Sep 13 '20 13:09 roman-petrov

I think Parcel should set the default number of Workers to 1 or something reasonable like 2 or 4. Using too many workers is SLOWER than just using 1 worker.

Running Parcel on a laptop with a modern processor with 16 threads (AMD Ryzen 7 4800H) and 16GB of RAM:

First setting the number of workers to 1, the build takes 618 ms.

❯ $env:PARCEL_WORKERS=1
❯ ( Measure-Command -Expression { npm run build }  ).Milliseconds
618

Then unsetting the number of workers, the build now takes 710 ms and I hit the #4628 issue. So, using the default config, there is no benefit and it is even worse.

❯ $env:PARCEL_WORKERS=""
❯ ( Measure-Command -Expression { npm run build }  ).Milliseconds
console: (node:19400) MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 17 error listeners added to [Socket]. Use emitter.set
MaxListeners() to increase limit
(Use `node --trace-warnings ...` to show where the warning was created)
console: (node:19400) MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 17 close listeners added to [Socket]. Use emitter.set
MaxListeners() to increase limit
710

Setting the number of workers to something reasonable like 4 seems to actually improve the performance:

❯ $env:PARCEL_WORKERS="4"
❯ ( Measure-Command -Expression { npm run build }  ).Milliseconds
412

Using more than 4 makes everything worse. For example, using 8 workers:

❯ $env:PARCEL_WORKERS="8"
❯ ( Measure-Command -Expression { npm run build }  ).Milliseconds
624

I haven't looked into the implementation, but if the workers are implemented using "processes", I have no doubt that they would be slower because each process starts a new Node process. If it uses WorkerThreads, then the overhead would be much less, but maybe you are spawning too many threads for a simple task.

If you are considering rewriting the implementation, threads package gives a nice abstraction over different types of threads with a single API.

aminya avatar Jan 09 '21 03:01 aminya

if the workers are implemented using "processes"

They aren't. Node's WorkerThreads are used when available (you could set PARCEL_WORKER_BACKEND=process to force using processes).

Could you check what https://github.com/parcel-bundler/parcel/blob/e183e32c811f066310fff0610aa856ca73a2fcb6/packages/core/workers/src/cpuCount.js#L49-L52 returns on your machine? That is used as the default PARCEL_WORKERS value on Windows.

I think we should cap that count to 4.

mischnic avatar Jan 09 '21 09:01 mischnic

Also, could you test this with a bigger project (that takes something like 30s) to determine whether this is the overhead from starting the workers itself.

mischnic avatar Jan 09 '21 13:01 mischnic

Could you check what

https://github.com/parcel-bundler/parcel/blob/e183e32c811f066310fff0610aa856ca73a2fcb6/packages/core/workers/src/cpuCount.js#L49-L52

returns on your machine? That is used as the default PARCEL_WORKERS value on Windows. I think we should cap that count to 4.

It returns 16 which is correct, but I don't think you should spawn all the 16 threads due to its overhead.

I need to do more testing, but it seems that more than 4 is destructive.

aminya avatar Jan 09 '21 18:01 aminya

It returns 16 which is correct

(It should return 8 because that function is supposed to determine the number of real cores and not threads.)

mischnic avatar Jan 09 '21 18:01 mischnic

It returns 16 which is correct

(It should return 8 because that function is supposed to determine the number of real cores and not threads.)

I fixed it in #5617

aminya avatar Jan 09 '21 19:01 aminya

I don't think hard coding to 4 is a good idea. It doesn't make sense to me that there is a limit regardless of hardware. We currently base it on the number of available cores, which seems to make sense in regards to the amount of parallelism that is possible. If it slows down after a certain number, there must be a bottleneck somewhere that we can potentially solve. I had looked into this somewhat before but couldn't determine what the problem was. Maybe someone will have better luck.

devongovett avatar Jan 09 '21 19:01 devongovett

Wrong benchmark

Running the benchmark using Parcel 2.0.0-nightly.520 gives another result. Using workers is slower altogether! No matter how many.

// disclaimer: these fluctuate. Running the command the second time gives a better result (despite running `npm run clean` in between)
1 worker: 384ms
2 workers: 450ms
4 workers: 698ms
8 workers: 746ms

Ran on solid-simple-table. The command was npm run build with && npm run style deleted from the end.

I don't think hard coding to 4 is a good idea. It doesn't make sense to me that there is a limit regardless of hardware. We currently base it on the number of available cores, which seems to make sense in regards to the amount of parallelism that is possible. If it slows down after a certain number, there must be a bottleneck somewhere that we can potentially solve. I had looked into this somewhat before but couldn't determine what the problem was. Maybe someone will have better luck.

Parallelism is only helpful when the overhead is low. That's when parallelism can have performance benefits. https://youtu.be/9hJkWwHDDxs?t=1016

aminya avatar Jan 09 '21 19:01 aminya

In your 300ms example, no workers might be faster (btw you can also do PARCEL_WORKERS=0 to actually use 0 workers).

But if you build takes more than a few seconds, it is beneficial.

mischnic avatar Jan 09 '21 19:01 mischnic

Sorry! I made a mistake in the Powershell command for this benchmark. I thought .Milliseconds converts the whole time to milliseconds, but it just returns the milliseconds part (without seconds)! 😥

I will run the benchmarks again

aminya avatar Jan 09 '21 19:01 aminya

OK, here is the correct benchmark. Still, increasing the number of workers has no effect.

Worker number Time (s)
0 9.72
1 10.63
2 10.18
4 10.42
6 10.37
8 10.36

Ran on solid-simple-table. The command was npm run build with && npm run style deleted from the end.

aminya avatar Jan 09 '21 21:01 aminya

Interesting. That indicates that there is some kind of bug to me. More workers definitely shouldn't be slower. Can you run with the --profile flag and upload the results? You can open the profile in Chrome dev tools to view the benchmark results yourself.

devongovett avatar Jan 09 '21 22:01 devongovett

Here is the profile: profile-20210109-161111.zip

What I see is that in the Node file system a lot of sync methods are used instead of using async methods and letting Windows manage the disk read and writes.

This load function for example is calling sync functions https://github.com/parcel-bundler/parcel/blob/e183e32c811f066310fff0610aa856ca73a2fcb6/packages/core/package-manager/src/NodePackageManager.js#L82

Sorted by total-time image

Sorted by self-time image

BTW, the profiler seems to have issues in writing the files to the disk. I have to run the profiler a couple of times (with clean in between) to get one working. It exits with a crash code.

@parcel/core: Starting profiling...
/ Optimizing SimpleTable.js...
npm ERR! code 3221225477

aminya avatar Jan 09 '21 22:01 aminya

Based on the profile I see a couple things:

  1. Was this profile done without workers? I only see the "Master" thread.
  2. It looks like a majority of the time (58%) was spent in cssnano, specifically loading (requiring) a preset. This seems extremely excessive to me. It seems it spent a ton of time waiting for I/O, as you can see based on the time spent in open, lstat, and realpathSync. Not sure what would explain that.

That said, one thing that could explain workers not being faster for some cases is if a majority of the build time is spent in minification of one large bundle, for example. This is not parallelizable, so we'd expect the times to be similar in this case. In your profile, transformation only accounts for 1.4s of the total build time, whereas minification accounts for 11s. During that time, only a single thread will be active.

devongovett avatar Jan 09 '21 22:01 devongovett

Results I get with your project:

macOS (4 core i5), last number is wall time

PARCEL_WORKERS=0: 6.74s user, 0.72s system, 150% cpu, 4.948 total
PARCEL_WORKERS=1: 6.88s user, 0.72s system, 155% cpu, 4.893 total
PARCEL_WORKERS=2: 8.42s user, 1.04s system, 188% cpu, 5.011 total
PARCEL_WORKERS=3: 7.58s user, 0.95s system, 190% cpu, 4.467 total
PARCEL_WORKERS=4: 7.75s user, 0.98s system, 191% cpu, 4.571 total

Windows 10 (much older 4 core i5):

PARCEL_WORKERS=0: 6.78s
PARCEL_WORKERS=1: 6.56s
PARCEL_WORKERS=2: 6.78s
PARCEL_WORKERS=3: 6.68s
PARCEL_WORKERS=4: 6.82s

not sure why it's slower in your case. Tested with Yarn & pnpm on macOS and Yarn on Windows.

mischnic avatar Jan 09 '21 22:01 mischnic

Based on the profile I see a couple things:

  1. Was this profile done without workers? I only see the "Master" thread.

No. The worker number was set to 8.

  1. It looks like a majority of the time (58%) was spent in cssnano, specifically loading (requiring) a preset. This seems extremely excessive to me.

I have only one small less file! I am not sure what minification it is doing there! https://github.com/aminya/solid-simple-table/blob/master/src/SimpleTable.less

It seems it spent a ton of time waiting for I/O, as you can see based on the time spent in open, lstat, and realpathSync. Not sure what would explain that.

Yes. The Disk IO seems the bottleneck here. It is irrelevant to the CPU.

aminya avatar Jan 09 '21 22:01 aminya

not sure why it's slower in your case. Tested with Yarn & pnpm on macOS and Yarn on Windows.

I used Powershell for timing. Parcel itself reports almost 4 seconds less!

√ Built in 6.07s

aminya avatar Jan 09 '21 22:01 aminya

Just to add some extra data, on my Ryzen 7 4800H with 16 GB of RAM, running [email protected] on node v15.6.0 takes:

  • 1 worker: 50.56s, 1994M resident memory
  • 2 workers: 33.80s, 1926M
  • 3 workers: 30.80s, 2723M
  • 4: 31.88s, 2915M
  • 8: 34.88s, 4426M
  • 16: fills up my RAM, proceeds to fill up my swap and annihilate my computer

I did the 3 workers test after testing with 2, 4 and 8 just to check for any potential sweet spot.

I noticed that rebuilding with a non-empty dist folder (that is, serve, kill, then serve again without cleaning) is slower: 35.56s, with 2383M resident memory. However, parcel memory usage is as inconsistent as it gets, so I attribute it to sheer luck.

ranisalt avatar Jan 15 '21 14:01 ranisalt

One interesting thing we haven't explored yet is whether this is specific to worker threads or whether it also applies to processes. Is there a single memory limit across all threads or is it per thread? This could affect the frequency of garbage collection. Could people in this thread also run it with the PARCEL_WORKER_BACKEND=process environment variable and compare the results with different worker counts?

devongovett avatar Jan 15 '21 15:01 devongovett

@devongovett with 4 workers, using process backend, it takes 35.73s (albeit I have more software running simultaneously) but the resident memory usage dropped to 834M

ranisalt avatar Jan 15 '21 16:01 ranisalt

Sure, it may be slower in the absolute sense than threads. I'm more interested in whether it still slows down as much after 4 process workers as it does with threads.

devongovett avatar Jan 15 '21 16:01 devongovett

As I showed before, the performance bottleneck is disk I/O and not CPU.

These benchmarks don't show any improvement because CPU is not an issue here.


Regarding CPU (assuming Parcel has fixed slow I/O problem):

Even if you write lock-free parallel code, you should not always spawn all the threads. Parcel spawns all the threads even for small builds which results in excess memory.

For example, the number of threads should be proportionate to the data size. For Parcel, a similar approach should be used. https://github.com/atom-community/zadeh/blob/97548782ae3b7dac66a087ad5afdad9c8fb7e770/src/common.h#L74

aminya avatar Jan 15 '21 20:01 aminya

I'm curious if you've tried setting the UV_THREADPOOL_SIZE environment variable. This adjusts the libuv (node) thread pool size which is what handles disk IO. By default it's 4. Is it possible that this is the bottleneck? More than 4 parcel workers might be bottlenecked by IO thread pool. Raising this may allow more parallel IO to occur.

devongovett avatar Jan 15 '21 20:01 devongovett

Here is a new profile I ran on solid-simple-table with Parcel 2.0.0-nightly.535 and the #5642 patch applied (in which it changes one of realPathSyncs with an async version).

profile-20210115-142429.zip

Other than workers 0 and 1, the others almost do nothing. image

This resolve preset is very suspicious. 3s to find a preset!

image

Another thing that makes me nervous is patching require from JavaScript. The native implementations might be faster, but this patching might result in de-optimization. https://github.com/parcel-bundler/parcel/blob/49ac217507d5914a2e38ca402593d26527cdb673/packages/core/package-manager/src/NodePackageManager.js#L103

I'm curious if you've tried setting the UV_THREADPOOL_SIZE environment variable. This adjusts the libuv (node) thread pool size which is what handles disk IO. By default it's 4

Not much difference. Tried all different numbers

aminya avatar Jan 15 '21 20:01 aminya

This resolve preset is very suspicious. 3s to find a preset

That wasn't the case for me, neither macOS nor Windows.

mischnic avatar Jan 15 '21 21:01 mischnic

Another interesting result. Using pnpm or yarn makes a difference in the build time. I think this is because pnpm uses symlinks instead of copying the files, and resolving the symlinks might take some time.

pnpm yarn
10.5 s 7.4 s

Here is the profile for yarn the bootstrapped package. profile-20210115-152126.zip

This is for pnpm: profile-20210115-142429.zip

aminya avatar Jan 15 '21 21:01 aminya

On Windows, we don't use Node's fs.readlpath.native because it's apparently buggy on Windows.

mischnic avatar Jan 15 '21 21:01 mischnic

This resolve preset is very suspicious. 3s to find a preset

That wasn't the case for me, neither macOS nor Windows.

Well, building CSSNano using Parcel generates a 3.7 MB JavaScript bundle. Imagine how long it takes for Node to find all these files and load them. Producing a single file reduces this I/O load.

See my PR: https://github.com/cssnano/cssnano/pull/985

aminya avatar Jan 15 '21 22:01 aminya

Have you tried doing require("cssnano") wihtout Parcel in a JS script and timing the require call? To make sure this is due to Node on Windows and not caused by Parcel itself.

mischnic avatar Jan 15 '21 22:01 mischnic

Have you tried doing require("cssnano") wihtout Parcel in a JS script and timing the require call? To make sure this is due to Node on Windows and not caused by Parcel itself.

Requiring cssnano itself is not an issue. This line is the slow part: https://github.com/cssnano/cssnano/blob/40b82dca7f53ac02cd4fe62846dec79b898ccb49/packages/cssnano/src/index.js#L34

Running node ./test_cssnano.js gives 716ms on my system.

//test_cssnano.js
let t = Date.now()

require('cssnano')
const p = require('cssnano-preset-default')({}).plugins

console.log(Date.now() - t)

aminya avatar Jan 15 '21 22:01 aminya

Using my cssnano PR here the build time is reduced 2 seconds!

The parcel PR: https://github.com/parcel-bundler/parcel/pull/5671

aminya avatar Jan 15 '21 23:01 aminya

@aminya have you looked into why disk access on your machine is so slow? I don't think anyone else has been able to reproduce the cssnano issue. Are you running off a network drive by chance?

devongovett avatar Jan 17 '21 14:01 devongovett

@aminya have you looked into why disk access on your machine is so slow? I don't think anyone else has been able to reproduce the cssnano issue. Are you running off a network drive by chance?

My drive is a fast SSD with high bandwidth. The issue is not my hardware.

image

require is slow and the more you have it, the more it punishes the user. CSSNano requires many files and Parcel's decision not to limit require calls by making a single file out of this huge library makes things worse. https://github.com/parcel-bundler/parcel/pull/5671#issuecomment-761820545

There is also the problem with not using realpath.native which makes some of the package managers like pnpm slower.

aminya avatar Jan 17 '21 21:01 aminya

Sure, I agree re require being slow. But there still exists the question why it takes 3s on your machine and milliseconds on other machines so trying to get to the bottom of it...

devongovett avatar Jan 17 '21 21:01 devongovett

@devongovett @mischnic

Is there any update on this? We observe high memory consumption too in contrast to parcel 1.

AndyOGo avatar Feb 14 '22 13:02 AndyOGo

@AndyOGo Could you share more details about your situation (and ideally a reproduction)? Do you also have many cores like some of the other commenters above? Does setting PARCEL_WORKERS=4 yarn parcel build help?

mischnic avatar Feb 14 '22 15:02 mischnic

can confirm heavy memory usage. A basic 200 line typescript file (32K in file-size according to du) that compiles in 2 seconds in tsc takes 15+ minutes and at least 4 GB Ram plus 8 GB swap. I've never once gotten parcel to compile this simple file as my computer all but crashes once all of its memory (swap plus hardware) becomes completely filled.

No special settings, config or anything. Exactly the getting-started demo, but with different typescript (typescript that tsc handles just fine). (Running fedora linux if that matters. I'm not using windows.)

(It's also really hard to kill. Spawns like 20-30 node processes. Note that my system is a meager 4 core Intel Core I5 mobile. Not some sorta threadripper. No reason to have that many workers for 200 lines of typescript on such low-grade hardware)

Lazerbeak12345 avatar Feb 24 '22 18:02 Lazerbeak12345

@Lazerbeak12345 Can you share your project (or some version of that typescript file which still causes this problem)?

mischnic avatar Feb 28 '22 13:02 mischnic

Sure https://github.com/Lazerbeak12345/pixelmanipulator/tree/v5-alpha is the closest thing - but you're going to have to remove these files (and replace them with the appropriate matching files from the getting started guide):

  • package.json
  • yarn.lock
  • tsconfig.json
  • gulpfile.ts

Another note is that you might need to remove src/demo as recent changes to that branch have now included further typescript files that should not be included in the library build itself.

Sorry - I had actually planned on making a separate branch to recreate this easier, but I don't have much time on my hands these days.

Alternatively, as the typescript file is (currently) completely standalone, copying the content of src/lib/pixelmanipulator.ts into the demo project (with typescript adjustments) should recreate it just fine as well.

I suspect that this might not actually be easy to recreate without hardware as old as mine though (more-or-less factory Lenovo ThinkPad T420 but with a more-or-less factory Fedora Linux 35 WS).

Lazerbeak12345 avatar Mar 01 '22 00:03 Lazerbeak12345

Yeah, it works for me on macOS (also 4 core i5). But I'm also not sure if this is because I didn't modify your repo correctly or if it's actually caused by some hardware/OS difference.

mischnic avatar Mar 01 '22 08:03 mischnic

I'll make a branch with the specific changes made then and link it here. This might take some time.

Lazerbeak12345 avatar Mar 01 '22 18:03 Lazerbeak12345

I'm actively working on this right now, and here's an interesting finding: If I provide yarn parcel build an entry point, this issue does not happen, but if i run it without (yet with entry points provided in the package.json) the issue still happens. I'll post further information later.

Lazerbeak12345 avatar Mar 17 '22 03:03 Lazerbeak12345

I'm guessing this is caused by a bad entry root/project root calculation then: https://github.com/parcel-bundler/parcel/blob/e294eafd9a49c056fb4b223c5ace5c0653428ede/packages/core/core/src/resolveOptions.js#L52-L53 (so that root is too high up in your FS and it does some unnecessary stuff then causing the memory usage)

Which should get fixed by https://github.com/parcel-bundler/parcel/pull/7537

mischnic avatar Mar 17 '22 09:03 mischnic

Alright, I've figured it out. (I actually solved it half an hour from my last post, but was offline so couldn't post this).

@mischnic Your hunch was correct.

Parcel seems to search for the "root" from / to $(pwd), looking for a file to indicate what package manager to use when auto-installing things. The problem was because this file was present: ~/yarn.lock. For my purposes, I don't need a lockfile of any sort in my ~ so I removed it, and since then, parcel has worked for me.

I don't really like that that was the problem - I expected parcel to use a depth-first search for the "root," as this is the behavior of node when searching for a package in a node_modules folder. (If node_modules isn't in the current folder or if the package is not present, try ../ until ../ is the same as ./).

Lazerbeak12345 avatar Mar 18 '22 01:03 Lazerbeak12345

It is also interesting to note, however, that while this is resolved for me - it still used a ton of threads and ran out of memory.

This implies that parcel might fail on huge projects.

Running parcel build in such a large directory should still build (even if, as in my case, that wasn't the intended result). I suspect that each thread (or group of threads) is associated with one or more files. In the case of my light hardware, it'd be better to use a queue system and postpone files when the maximum threads have been reached.

Also unexpected, was that --log-level verbose didn't say what directory it believed the root to be. I learned that from an odd error message that I eventually triggered by accident. (this verbosity thing perhaps warrants its own issue, if one doesn't exist for it yet)

Lazerbeak12345 avatar Mar 18 '22 01:03 Lazerbeak12345

The PARCEL_WORKERS=1 directive did wonders for me on my Macbook M1. I am currently running Parcel v1 and my machine would stutter to the point of being unusable whenever the serve process started bundling my changes.

jpcaparas avatar Jul 18 '22 22:07 jpcaparas