js-framework-benchmark icon indicating copy to clipboard operation
js-framework-benchmark copied to clipboard

Adding 10k rows tests in memory benchmarks

Open fabiospampinato opened this issue 2 years ago • 23 comments

This would probably increase the time it takes to run the full benchmark by a lot, but it'd be interesting to run some memory benchmark with 10k rows too, other than with 1k rows. Maybe just one?

Having both data points should give an idea of how memory usage scales with the size of the app, which I think would be pretty interesting to see.

fabiospampinato avatar Jun 02 '22 22:06 fabiospampinato

Here's the first result: https://krausest.github.io/js-framework-benchmark/2022/table_chrome_103_mem10k.html

The spread is really big - pretty interesting. What do you think? Is there a memory benchmark that we could drop instead?

krausest avatar Jun 29 '22 05:06 krausest

That's super interesting, thanks!

Locally I've been benchmarking Voby against Solid, and there I see Voby using ~2MB less memory than Solid with 10k rows, so the result seems consistent with that, though the numbers I see are more like 20MB vs 22MB rather than 10MB vs 12MB. Maybe the size of the nodes is not accounted for in the test?

Other than that it just looks pretty interesting to me, I feel like it tells me a very useful bit of information that I couldn't get otherwise from the rest of the tests.

Maybe if something had to be dropped I might consider dropping the "replace 1k rows" memory test, it doesn't seem particularly useful, at the end of the day there are still 1k rows on the page.

Personally for my use case I don't care at all about the "ready memory" test, as if the fixed overhead is anything between 0 to 5MB it just doesn't really matter, and it looks like the worst performing framework adds 1.6MB of overhead over the best performing one, I think that's basically irrelevant in practice.

fabiospampinato avatar Jun 29 '22 09:06 fabiospampinato

Thanks - here's some information about how I calculate the memory values:

The benchmark performs a GC (HeapProfiler.collectGarbage) and then calls Performance.getMetrics which reports among other values: { name: 'JSHeapUsedSize', value: 9925796 }, { name: 'JSHeapTotalSize', value: 12550144 },

When I create 10,000 rows by manual testing and look at the memory tab I see the following: Bildschirmfoto 2022-06-29 um 21 55 31

The total heap size is pretty close to the value above. But I think we should not take the allocated memory (JSHeapTotalSize), but the used memory (JSHeapUsedSize) for the benchmark.

krausest avatar Jun 29 '22 20:06 krausest

When I create 10,000 rows by manual testing and look at the memory tab I see the following:

That's pretty interesting, I don't actually know what that number actually measures.

I see this:

Screen Shot 2022-06-29 at 22 18 43 Screen Shot 2022-06-29 at 22 19 00

Like I see ~8MB from the preview, but then the heap weighs ~20MB, so definitely not just ~8MBs are being used. I think it would be better to switch to a more reliable measure.

fabiospampinato avatar Jun 29 '22 21:06 fabiospampinato

FWIW a bunch of people seemed to find this new test interesting: https://twitter.com/fabiospampinato/status/1542083521598169088

fabiospampinato avatar Jun 30 '22 08:06 fabiospampinato

it gives you a picture of how memory usage will scale in real applications beyond tiny things.

To understand how it will scale in real applications, you will need to add tests that increase ratio of dynamic bindings per DOM elements, decrease ratio of DOM elements per composable (function or component, depends on the framework). To test reactive systems, it is also important how it will perform with one-to-many/many-to-one scenarios, deep derived computation graphs, etc. Increasing number of rows doesn't give any useful information, unless the only thing that your "real applications" are doing is just cloning large html chunks.

localvoid avatar Jun 30 '22 10:06 localvoid

Obviously this is a benchmark, the actual ratio of DOM nodes to reactive bits may be different in a real application, the number of those things may be different, other factors may dominate memory usage for an application etc. And this is a fairly general benchmark, it wouldn't make sense for the non-reactive frameworks to introduce tests specifically measuring reactive things, the cellx benchmark I think is more appropriate for that.

Personally I find this test interesting because the numbers you get for 10k rows are not the same as the ones for 1k rows but multiplied by a fixed constant across frameworks, there's a pretty interesting spread there, and memory usage matters more when there are more things, so that's a more important measure.

fabiospampinato avatar Jun 30 '22 11:06 fabiospampinato

And this is a fairly general benchmark, it wouldn't make sense for the non-reactive frameworks to introduce tests specifically measuring reactive things

It does, one of the main reasons why React was "invented" is to avoid incremental algorithms when working with an application state, so they can just diff UI tree and reactive libraries will invalidate their computation graphs.

and memory usage matters more when there are more things, so that's a more important measure.

10k rows is ~80k DOM nodes, with such insane amount of DOM nodes it takes like ~4ms (Chrome) on a powerful CPU to perform hittests when moving a mouse. It just inflates memory usage for libraries that doesn't optimize for memory usage per DOM element as their primary optimization goal.

I've worked on a reactive-only UI library and tried to optimize memory as much as possible, because in real applications memory usage will be completely different (such low numbers for reactive libraries in this benchmark are misleading):

Screenshot

localvoid avatar Jun 30 '22 12:06 localvoid

It does, one of the main reasons why React was "invented" is to avoid incremental algorithms when working with an application state, so they can just diff UI tree and reactive libraries will invalidate their computation graphs.

What I'm saying is that let's say the cellx benchmark were to be integrated here, hypothetically, how would you even write an implementation for that with React? It just doesn't make sense, React does things with components, there are no standalone reactive primitives in there.

10k rows is ~80k DOM nodes, with such insane amount of DOM nodes it takes like ~4ms (Chrome) on a powerful CPU to perform hittests when moving a mouse. It just inflates memory usage for libraries that doesn't optimize for memory usage per DOM element as their primary optimization goal.

Sure, it's a lot of DOM nodes, but here we only have tests with either 1k or 10k rows, maybe we should have 5k or 3k rows tests too?

Regarding memory usage I'm not sure the memory required for those nodes was actually taken into account by this test, according to heap snapshots it seems to take ~5MB just to keep the anchor nodes in memory, that's more than the number reported for the vanillajs implementation, which doesn't make sense. It seems to be actually measuring the rest of the stuff, which is like 10k effects and 20k event handlers basically, which is still a lot though potentially, but possibly reasonable.

I've worked on a reactive-only UI library and tried to optimize memory as much as possible, because in real applications memory usage will be completely different (such low numbers for reactive libraries in this benchmark are misleading):

It'd be interesting to see what you get with 10k nodes, I think I've optimized mine (voby) approximately to the bones, and yours seems to be doing potentially a lot better than that, which I'm not sure it's achievable with a very similar API.

At the end of the day I see React being slow and a memory hog in the benchmark compared to Solid, with every likelihood that's what I'm going to see in my real application. But sure depending on what I write and what I care about these numbers may not be all that relevant to me, but they do tell something useful.

fabiospampinato avatar Jun 30 '22 12:06 fabiospampinato

What I'm saying is that let's say the cellx benchmark were to be integrated here, hypothetically, how would you even write an implementation for that with React?

Let's take for example one-to-many scenario, it can be something like results table in this benchmark where we need to calculate background color for each cell. When minmax value is changed, react is going to rerender entire table and reactive library will invalidate each background binding.

It seems to be actually measuring the rest of the stuff, which is like 10k effects and 20k event handlers basically, which is still a lot though potentially, but possibly reasonable.

The problem is that libraries with template cloning can significantly reduce "the rest of the stuff" per DOM element when cloning large html chunks and attaching direct bindings (for majority of libraries in this benchmark it is more like 80k effects). It is a great optimization for applications that doesn't use reusable components a lot, but if it is built from reusable components like Fluent UI, memory usage will be completely different.

which I'm not sure it's achievable with a very similar API.

I am using completely different data structures and algorithms with stronger scheduling order.

localvoid avatar Jun 30 '22 13:06 localvoid

Let's take for example one-to-many scenario, it can be something like results table in this benchmark where we need to calculate background color for each cell. When minmax value is changed, react is going to rerender entire table and reactive library will invalidate each background binding.

That seems somewhat similar to what this benchmark is measuring already (the selected row changes and all rows are diffed in React?), nothing like cellx's benchmark that measures the reactivity bits specifically.

The problem is that libraries with template cloning can significantly reduce "the rest of the stuff" per DOM element when cloning large html chunks and attaching direct bindings (for majority of libraries in this benchmark it is more like 80k effects). It is a great optimization for applications that doesn't use reusable components a lot, but if it is built from reusable components like Fluent UI, memory usage will be completely different.

What do you mean by reusable here? Because that's an optimization that can be applied, for example, completely at runtime to tagged template literals, or to anything with a little helper function, without any particular restriction as far as I'm aware, if implemented properly. Like it seems to me that if frameworks don't implement this it's their bad, they should have worse numbers in this benchmark. Like Solid's components are reusable and do this optimization, so I don't know what your point is.

I think this benchmark is a bit too heavy on measuring creation time though.

I am using completely different data structures and algorithms with stronger scheduling order.

It'd be interesting to take a look at that when you publish it 👍

fabiospampinato avatar Jun 30 '22 13:06 fabiospampinato

That seems somewhat similar to what this benchmark is measuring already (the selected row changes and all rows are diffed in React?)

A lot of reactive libraries are using a workaround createSelector() for this specific use case to avoid one-to-many, but it is not a general purpose solution. It just quite strange to me that a lot of reactive library authors are focusing on the best case scenario with this benchmark. When I've started to work on reactive library, the first thing that I've done is implemented worst case scenario by reimplementing top-down data diffing with heavily nested computation graph and tried to beat the fastest top-down incremental libraries.

What do you mean by reusable here?

I mean that if your application is built from small reusable components (Fluent UI, MUI, etc), you'll have ~1 DOM Node per reusable function/component and template cloning optimization will be completely useless.

localvoid avatar Jun 30 '22 14:06 localvoid

I mean that if your application is built from small reusable components (Fluent UI, MUI, etc), you'll have ~1 DOM Node per reusable function/component and template cloning optimization will be completely useless.

"Completely useless" seems an exaggeration, at least if your framework can do that you have the option to take advantage of that, in case you need to. But in general it's true, the benefits of that will be less than what one can see in this benchmark, especially since this is fairly heavily biased toward measuring creation time, imo.

fabiospampinato avatar Jun 30 '22 14:06 fabiospampinato

"Completely useless" seems an exaggeration, at least if your framework can do that you have the option to take advantage of that, in case you need to.

Agree, I've meant that in a worst case scenario with 1 node per component it is completely useless. There are a lot of applications that can be built without such component granularity and that is why I also implemented this optimization. But when optimizing a reactive library, I would prefer for it to work fast and consume as small memory as possible not just with 1 effect per 8 dom nodes, but also with 1 effect per 1 dom node.

localvoid avatar Jun 30 '22 14:06 localvoid

Back to @fabiospampinato and the problem with the "reliable measure". I'm really close to giving up on measuring memory. Here are values that I could use (I'm putting a long sleep right after the benchmark and measured the other values in that browser instance):

puppeteer: JSHeapUsedSize of page.metrics() memory result for voby and 26_run-10k-memory: 9.57651138305664 MB cdp Performance.getMetrics JSHeapUsedSize (Yeah - the same as page.metrics()) Performance.getMetrics 10041700 bytes 9.57651138305664 MB

window performance (11.35 MB): Bildschirmfoto 2022-06-30 um 22 10 35

Timeline trace: 9.3 MB Bildschirmfoto 2022-06-30 um 21 58 12 Memory tab: Bildschirmfoto 2022-06-30 um 21 58 22 Heap (take quite long to complete): Bildschirmfoto 2022-06-30 um 21 58 40

I'm not in the position to call any of the values correct. Maybe it wouldn't matter much if it's 9.3 or 9.5 MB as long as it's consistent, but 23.7 MB is just too different. Does someone have an opinion?

krausest avatar Jun 30 '22 20:06 krausest

They are like 3 totally different numbers 🤔 Maybe as you say it doesn't matter much as long as it's consistent. Between all these numbers I tend to trust the actual heap snapshot because it's just super detailed, it seems more plausible to me to think that the other numbers aren't quite right, or aren't quite measuring the same thing. Maybe those other numbers are measuring something similar but without accounting for objects that live outside of JS world? (Like I guess DOM nodes are allocated elsewhere but exposed to JS?).

It might be useful if a garbage collection could be triggered manually. Other than that I think I'd personally go with the number from the heap snapshot if it can be retrieved quickly, otherwise with the number that you were using previously.

fabiospampinato avatar Jun 30 '22 20:06 fabiospampinato

Thanks. A heap snapshot is performed before all the measurements above. (https://github.com/krausest/js-framework-benchmark/blob/master/webdriver-ts/src/forkedBenchmarkRunnerPuppeteer.ts#L59 a bit verbose for puppeteer or https://github.com/krausest/js-framework-benchmark/blob/master/webdriver-ts/src/forkedBenchmarkRunnerPlaywright.ts#L207 for playwright) @paulirish It would be great if you shed a little light on the difference between those metrics!

krausest avatar Jun 30 '22 20:06 krausest

yeahhhhhhhhh memory measurement is a weird world. i'm not an expert, but i know a little that can help.

as for APIs..

  1. cdp Performance.getMetrics - JSHeapUsedSize... (same as pptr page.metrics())
  2. cdp Runtime.getHeapUsage
    • looking at the chromium backend both this and getMetrics use V8's HeapStatistics, so... they should be similar.. however perhaps there's a difference in summing up 1 or all the isolates. shrug.
  3. taking a heap snapshot and sum up the bytes?
    • seems terrible
  4. performance.memory - reports coarse quantized numbers, unless site isolation is on.
    • the quantizing thing is old and made the feature basically useless. this new siteisolation upgrade is new. unfortunately all the docs are not public. but the site isolation requirements are the same as the next thing (and SAB, etc).
  5. performance.measureUserAgentSpecificMemory()
    • this API was introduced to capture other memory concerns beyond the JS heap. (DOM, images, etc..)
    • the demo was broken so i fixed it:

image

There's a decent chance this last method is a better and more accurate method than a CDP command. I wouldn't normally expect that, but Ulan (the eng behind it) is the v8 guy we'd always talk to when figuring out memory reporting weirdness. (And the CDP memory stuff hasn't been touched for a while)

The other thing i'd suggest..... try GC'ing like 7 times before you measure. I can't remember why... but years ago we had folks from the v8 team tell us you need to trigger GC like A LOT to actually have it GC. @brendankenny might remember some details. (okay im seeing v8 tests that just call it twice) Also.. there might be a difference between the window.gc() method and HeapProfiler.collectGarbage... looking briefly through v8, it appears so..

paulirish avatar Jul 01 '22 21:07 paulirish

The other thing i'd suggest..... try GC'ing like 7 times before you measure. I can't remember why... but years ago we had folks from the v8 team tell us you need to trigger GC like A LOT to actually have it GC.

Lol what? 😂 Good to know!

TL;DR: it's an absolute minefield.

fabiospampinato avatar Jul 01 '22 21:07 fabiospampinato

So - another run has finished (the duration for the whole run is about 15 hours - almost too long): https://krausest.github.io/js-framework-benchmark/2022/table_chrome_103_mem10k_2.html

The following is now implemented:

  • Memory is now measured via (await performance.measureUserAgentSpecificMemory()).bytes
  • Before memory measurement the GC runs 7x using windows.gc()
  • All CPU benchmarks force GC (as above 7x windows.gc()) before the final duration measurement starts
  • All CPU benchmarks except 'append rows' use a warmup phase. Both reduce the noise quite a lot. If you use the compare link in the table you'll see that the statistical test yields better results due to the reduced variance.)

krausest avatar Jul 03 '22 12:07 krausest

Almost great - performance.measureUserAgentSpecificMemory() doesn't work headless neither for playwright nor for puppeteer: https://github.com/puppeteer/puppeteer/issues/8258

krausest avatar Jul 03 '22 15:07 krausest

@krausest there's a newish headless mode (that nobody really knows about)... but it generally includes all the things that normally arent in the typical --headless invocation. It's --headless=chrome and AFAIK it works on linux and windows. I imagine this method might work in that case.

Also.. big disclaimer that I haven't spent any time with this feature but know the guys who implemented it. They seem to think its good. :) gl!

paulirish avatar Jul 06 '22 23:07 paulirish

@paulirish Great - seems like it even works on OSX! I could remove the special handling again! Thanks a lot.

krausest avatar Jul 07 '22 18:07 krausest

The option is from chrome 109 on called --headless=new 🤷

krausest avatar Mar 05 '23 08:03 krausest