js-framework-benchmark icon indicating copy to clipboard operation
js-framework-benchmark copied to clipboard

A full run takes too long

Open krausest opened this issue 2 months ago • 17 comments

The full run for chrome 141 took 14 hours just for the keyed results. Non-keyed are faster, but I'm already skipping non-keyed for odd chrome versions. Due to the long duration I'm already delaying performing the run. The goal should be that the benchmark completes the run in 8-10 hours for keyed implementations such that it fits in a night.

The question is how to reduce duration:

  • Retire old frameworks
  • Remove some benchmarks that add little value

krausest avatar Oct 07 '25 18:10 krausest

Have you thought of running the benchmark with Bun?

n-ce avatar Oct 11 '25 09:10 n-ce

here's the list of frameworks that are still getting maintained :

  • React
  • Vue
  • Svelte
  • Angular
  • Solid
  • Preact
  • Inferno
  • Mithril
  • Leptos
  • Dioxus
  • uhtml
  • ripple
  • lit
  • Vanjs
  • Alpine
  • ivi
  • omi
  • ember
  • karyon
  • imba
  • qwik
  • blazor
  • stenciljs
  • michijs
  • targetjs

with regards to your 'removing less value' frameworks, I think a middle ground is to run them only once every two tests.

n-ce avatar Oct 11 '25 09:10 n-ce

Have you thought of running the benchmark with Bun?

I'm pretty sure that most time is spent by puppeteer executing the test in the browser, so I don't see that bun can improve the run time.

krausest avatar Oct 12 '25 20:10 krausest

Here's a pair plot for the memory benchmarks. Seems like having both 22_run-memory and 23_update5-memory adds little value (they have a correlation of 0.9981931902570598), Image

krausest avatar Oct 12 '25 20:10 krausest

Here's the pair plot for the CPU benchmarks:

Image

The highest correlation here is between 01_run1k and 08_create1k-after1k_x2: 0.9589454583199726 Maybe also a candidate for removal.

krausest avatar Oct 12 '25 20:10 krausest

maybe just re-use the prior results from the deep-in-the-red frameworks unless they update or someone reaches out to indicate significant optimizations? i bet they burn 20%+ of the total runtime without having moved the needle in months or years.

leeoniya avatar Oct 12 '25 21:10 leeoniya

The correlation thing is insightful, with some degree of tolerance you may shave all scenarios except 01_run1k, and maybe that is the intention if taking in consideration the weights assigned recently to each scenario. It's a kind of "optimization" like comparing how many plums go into the factory with how much jam comes out — if there's correlation, why bother measuring the jam.

And how about the "mechanics" between 01_run1k and 08_create1k-after1k_x2? While it creates rows as others, 08_create1k-after1k_x2 is a completely different thing, as it implies running reconcile algorithm (in proper frameworks). Comparing it to 01_run1k and 02_replace1k is not very useful, since the latter ones just add rows to an empty container.

More logically is to get rid of redundant/useless scenarios. The first of these is 07_create10k, it was mentioned several times on different threads that it is an absurd thing - basically a magnification of 01_run1k multiplied by browser side effects. The second is 02_replace1k which is basically 09_clear1k_x8 + 01_run1k I guess that removing these two would cut the running time by half.

And about controversial "Retire old frameworks", it also raises doubts about the intent of this benchmark, because some full-fledged real frameworks like Knockout may end up in the trash to make space for some hot-rod/gelded/whatever-you-name-it ones conceived especially for this "competition". Maybe it's time to define a compliance level for the "participation" — something like a proper "swap N random rows" instead of a lazy swap(a, b) or worse.

syduki avatar Oct 13 '25 09:10 syduki

  1. lui is also maintained.
  2. Why dont you split the benchmark up? Just add a split param to the bench command, so you run the first half in the first night and the second half in the second night. 😏

L3P3 avatar Oct 16 '25 12:10 L3P3

Hello, I think problem is with the fact that current benchmark is mostly single threaded and sequential, it pretty much doesn't load cpu and the majority of the time is spent idling. So how about running multiple instances of chromium at the same time, and give each instance something like two dedicated cores / threads (number is an arbitrary example), that will cut total runtime substantially. There are possibly other, maybe better, ways of giving CPU shares, worth investigating, but you get the idea. Just need verify that it doesn't skew the results much.

kanashimia avatar Oct 16 '25 22:10 kanashimia

So how about running multiple instances of chromium at the same time

since this runs on a laptop, we might get into trouble with cpu throttling

leeoniya avatar Oct 16 '25 22:10 leeoniya

multiple instances of chromium at the same time Just need verify that it doesn't skew the results much.

That's not really on topic, it will definitely skew the results. You don't even need to test it, just read the threads on how the benchmark run is prepared. "so it fits in a night" isn't without reason, @krausest has to disconnect almost all electrical appliances to avoid messing up the measurements. 😅 And even if you ignore the CPU throttling (laptop in the freezer? 😜), there's no true isolation between CPUs and processes, they will end up competing with each other. That would be more of a "best effort" test than a "best result" one.

syduki avatar Oct 17 '25 10:10 syduki

since this runs on a laptop, we might get into trouble with cpu throttling

That is true, not sure CPU utilisation would be so high that it will cause throttling though, a lot of time will still be spent idling, and it is something that is preventable too.

has to disconnect almost all electrical appliances to avoid messing up the measurements. 😅

Audiophile style of thing.

there's no true isolation between CPUs and processes, they will end up competing with each other. That would be more of a "best effort" test than a "best result" one.

People have been doing benchmarks like that using isolcpus / taskset for years on Linux without problems.

To be fair I don't know if macos has such facilities though, haven't worked with that, it probably doesn't so my suggestion isn't applicable for the current setup I guess.

kanashimia avatar Oct 17 '25 11:10 kanashimia

People have been doing benchmarks like that using isolcpus / taskset for years on Linux without problems.

pretty dependent on what you're benchmarking, i'm sure. benchmarking ripgrep will have a very different noise/stddev profile than launching an extremely complex browser and js runtime with a JIT, a GC, and a million other dynamic things.

that being said, i dont necessarily have the same opinion about how stable the benchmarks need to stay across runs. +/- 2% is perfectly fine for anyone looking to choose a fast framework or avoid a slow one. the amount of effort being invested here into keeping the leaderboard stable is heroic but imo not necessary.

leeoniya avatar Oct 17 '25 11:10 leeoniya

As I said, just add a --partial 1/2 arg to the bench call and a --partial 2/2 the next evening. Its more reliable than other approaches and the more benchmarks there are, the better.

L3P3 avatar Oct 17 '25 16:10 L3P3

@L3P3 Most is in place for your idea. The outer benchmark loop goes over the benchmarks and the inner loop over the frameworks. This helps getting fairness when system conditions vary for different runs that each benchmark should have similar conditions. Using the parameter like "--benchmark 01_" allows splitting. I didn't do that so far since I want to let it run as long as possible so I want to interrupt the benchmark. The issue is that the runner prints the summary (which shows if any errors occured) only when it exits regularly. But I could take your idea and create a batch script which achieves the splitting and the summary. I think I'll try that.

Further benchmark 23 will not be run in future. Maybe I'll move the real slow frameworks into a new folder (slow_frameworks). Slow would be a factor > 3.

krausest avatar Oct 19 '25 18:10 krausest

Chrome 142 results are published. Total duration was 12:54 for the keyed implementations (non-keyed is already performed in a separate run). I skipped the memory benchmarks 23 (update rows) and 26 (create 10k rows). Maybe I'll benchmark 26 run it later just to see how long it takes.

Duration for chrome 142 keyed results:

Image

I was surprised to see that select row takes as long as create 10k rows. Maybe I can cut warmup a bit... Update: Just to remind myself: We're performing 25 runs for select 1k and 15 for the other benchmarks. vanillajs takes 96% of the time with just a single warmup select click. Not worth optimizing.

TODOs:

  • [ ] measure duration for the slowest implementations
  • [ ] maybe skip 07 (create 10k) and 02 (replace)

krausest avatar Nov 01 '25 18:11 krausest

One thing you could probably do is just not run benchmarks for frameworks that capture note 772 – other than the vanilla benchmark runs. I still don't even really know why they're included on the benchmark as they're not really benchmarking anything useful (excluding some WASM libraries, but they have no real alternative I guess?)

I also feel like the multiple variants of using a single framework with different state management libraries is overkill – reducing that to a few variants would also dramatically reduce overhead. The same goes with Svelte classic – just get rid of it, it doesn't add any real value.

trueadm avatar Nov 07 '25 22:11 trueadm

I'm in the midst of running chrome 143 and found the following pretty interesting: Image The chart shows how long all CPU benchmarks for a framework take relative to vanillajs including all browser starting ("gross" duration in contract to the "net" duration measured in the results). The "gross" results are so much lower than expected ("net" duration for blazor is >5 slower than vanilla js, but the "gross" duration for blazor is just ~1.4 as long and qwik takes even longer. This is due to starting the browser, opening the page and waiting for the page to be ready and performing the warmup runs).

krausest avatar Dec 08 '25 19:12 krausest

Measurement context totally changes the meaning of the numbers, and your observation actually lines up with real life, even down at the subatomic level: "How fast is the boson moving once the collider is already running at full power?" ("net") vs "How long does it take to build the collider, power it on, stabilize the beams, AND then measure the boson speed?" ("gross") 🙂

syduki avatar Dec 08 '25 22:12 syduki

Yes, maybe I could have put it simpler: Skipping the slowest frameworks doesn't help over-proportionally due to the overhead of running the benchmark.

krausest avatar Dec 09 '25 11:12 krausest