lucene icon indicating copy to clipboard operation
lucene copied to clipboard

Nightly benchmark regression on 2025.05.01

Open gf2121 opened this issue 6 months ago • 26 comments

Description

I'm seeing a big performance change (mostly regression) on 2025.05.01 benchmark, without an annotation. There are many commits diff for this run, i have not managed to identify but maybe someone has the context.

https://benchmarks.mikemccandless.com/2025.05.01.10.59.59.html

gf2121 avatar May 08 '25 15:05 gf2121

The JDK 24 upgrade is included in that diff - that may be most likely? The other big changes are the MMapDirectory change and the first IOContext PR, but those are both interface-only changes that don't change behaviour.

thecoop avatar May 09 '25 12:05 thecoop

Yikes, I will revert back to JDK 23 temporarily and see if the regression goes away. That is a big drop for some queries, e.g. VectorSearch, though this query is quite fast ~500 -> ~400 QPS.

mikemccand avatar May 09 '25 20:05 mikemccand

Yikes, I will revert back to JDK 23 temporarily

Oh, hmmm, maybe not -- JDK 23 EOL'd.

mikemccand avatar May 09 '25 20:05 mikemccand

Oh, hmmm, maybe not -- JDK 23 EOL'd.

you can still download it the old fashioned way for a test: https://www.oracle.com/java/technologies/javase/jdk23-archive-downloads.html

rmuir avatar May 11 '25 20:05 rmuir

I tried to reproduce the slowdown locally by

  • comparing on the commit just before we started requiring Java 24, with the baseline on Java 23 and the contender on Java 24 (by passing javaCommand=... on the competitor in my localrun.py),
  • by comparing the commit that started requiring Java 24 with current main, both on Java 24

However, neither of these benchmarks reported a significant performance difference, despite my CPU (AMD Ryzen 9 3900X) being similar to Mike's beast's.

jpountz avatar May 12 '25 11:05 jpountz

Thanks @jpountz -- I had done a full system update (pacman -Syyu) on 2025-05-01, and lots of packages were updated, including kernel:

[2025-05-01T10:18:13-0400] [ALPM] upgraded linux (6.12.4.arch1-1 -> 6.14.4.arch1-2)

So maybe one of these other packages somehow caused the slowdown ... but first I'll try putting just Java 23 back and then we can see/iterate.

I'm also not liking these unrelated hardware errors!! But they are pre-existing for a long time now...

[864752.786334] [Hardware Error]: Unified Memory Controller Ext. Error Code: 0
[864752.786365] EDAC MC0: 1 CE on mc#0csrow#3channel#4 (csrow:3 channel:4 page:0x16f7d3 offset:0x340 grain:64 syndrome:0x1)
[864752.882910] [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
[887362.459304] mce: [Hardware Error]: Machine check events logged
[887362.493034] [Hardware Error]: Corrected error, no action required.
[887362.527169] [Hardware Error]: CPU:2 (17:31:0) MC17_STATUS[-|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|Scrub]: 0x9c2041000000011b
[887362.593777] [Hardware Error]: Error Addr: 0x000000004bdf4940
[887362.627330] [Hardware Error]: IPID: 0x0000009600450f00, Syndrome: 0xa2e800010a800e03

Yay ECC RAM!

mikemccand avatar May 12 '25 23:05 mikemccand

Thats a big kernel jump. But it is hard to reason about, or declare a problem, because I'm still not sure what we are benchmarking here. See vmstat graphs: https://benchmarks.mikemccandless.com/2025.05.01.10.59.59/index.html

from my perspective, the system is idling. We should max out the hardware?

rmuir avatar May 13 '25 00:05 rmuir

Hmm downgrading to Java 23 is not so simple ... I got it installed, cutover benchy's N places to use Java 23, but then Lucene's main insists in at least two places that I'm running Java 24. Could I temporarily turn off this check? Or is there a real/subtle reason why Java 23 will not work anymore (besides that it is officially EOL'd)?

mikemccand avatar May 15 '25 19:05 mikemccand

I believe that this change from @uschindler for instance requires Java 24 https://github.com/apache/lucene/commit/10c31217a91563ce42a0ab26a677738c181a7b63.

jpountz avatar May 15 '25 19:05 jpountz

Maybe you could compare performance with Java 23 on the commit before Lucene started requiring Java 24 (612f0da4a4ce3a133b40402a87ec5cf7eeb290cc). If performance is still bad, this would indicate that the regression is due to the kernel/packages upgrade rather than the Java 24 upgrade?

jpountz avatar May 15 '25 19:05 jpountz

Maybe you could compare performance with Java 23 on the commit before Lucene started requiring Java 24 (612f0da). If performance is still bad, this would indicate that the regression is due to the kernel/packages upgrade rather than the Java 24 upgrade?

+1 -- I will attempt this, on the nightly benchy box. Thanks @jpountz.

mikemccand avatar May 15 '25 22:05 mikemccand

ugh this one is difficult to debug. thank you guys for digging in.

rmuir avatar May 15 '25 22:05 rmuir

Hmm downgrading to Java 23 is not so simple ... I got it installed, cutover benchy's N places to use Java 23, but then Lucene's main insists in at least two places that I'm running Java 24. Could I temporarily turn off this check? Or is there a real/subtle reason why Java 23 will not work anymore (besides that it is officially EOL'd)?

You can't compile the expressions module anymore. The change @jpountz mentions would not prevent you from building.

The biggest problem is the vector code which was updated to java 24 already,

uschindler avatar May 16 '25 09:05 uschindler

I'm also not liking these unrelated hardware errors!! But they are pre-existing for a long time now...

This happens on modern hardware more often. You often get PCIe checksum errors which can be corrected when the NVMe is heavily used, According to the hardware builder this is not a reason to complain, because the kernel is working as expected (repeating the PCIe request or correcting the CRC error). They say only if you get more than a few per minute, its a reason to change the board or other hardware.

The problem is more the verbose error logging of the kernel. It should be reduced because it alarms users for things which aren't important. As a physicist I can just say: the hardware gets so complex and packing is so high that "overspeaking" on PCIe lanes is typical and the kernel needs to be prepared to fix that if the controller can't do it. Nowadays you get CRC errors already on the bus when you trash your litter outside of the house and some metal parts fall down....

uschindler avatar May 16 '25 09:05 uschindler

Thanks @uschindler! I was able to get @jpountz idea to work -- it ran in last night's run (2025-05-15) and it looks to me like Java 23 -> 24 was not responsible for the slowdown! VectorSearch is still slow on the last data point, and same for CountAndHighHigh. This is on Lucene's main https://github.com/apache/lucene/commit/10c31217a91563ce42a0ab26a677738c181a7b63.

I think this is sort of good news -- Java upgrade isn't to blame.

But then it leaves the sinister question of what was to blame ... lemme get the full list of changed packages on that day.

mikemccand avatar May 16 '25 16:05 mikemccand

Prime suspect: https://gitlab.archlinux.org/archlinux/packaging/packages/linux/-/commit/027f29fb4104bac71151c47ce637fb18579a4a36

You may look at other changes to flags and such on the arch package between 6.12 and 6.14 but i would bet a beer on that one.

rmuir avatar May 16 '25 19:05 rmuir

OK thank you @jpountz and @rmuir! Downgrading the kernel to 6.12.4 indeed recovered our lost performance, so it looks like that change in HZ (how many time slices the Linux scheduler divides each second into) from 300 (Arch Linux) -> 1000 was the cause of our slowdown.

It's surprising it has such a big impact ... seems maybe we all might look into tuning this HZ in our production clusters!

I added annotations (see IG, IH, and II, e.g. on the NumericRangeQuery (int) task.

I will put everything back (upgraded kernel, Lucene back on main, Java back on 24), so we accept and just annotate this performance impact.

mikemccand avatar May 18 '25 21:05 mikemccand

OK I've re-upgraded everything (Lucene is on tip main, Java is back to 24, Linux kernel now even further upgraded to 6.14.6) and kicked off benchy run.

@rmuir explained a simple way to check the HZ on your production boxes: cat /proc/config.gz | gunzip | grep CONFIG_HZ, e.g. on beast3 (nightly Lucene benchy box) it produces this:

beast3:util.nightly[main]$ cat /proc/config.gz | gunzip | grep CONFIG_HZ
# CONFIG_HZ_PERIODIC is not set
# CONFIG_HZ_100 is not set
# CONFIG_HZ_250 is not set
# CONFIG_HZ_300 is not set
CONFIG_HZ_1000=y
CONFIG_HZ=1000

Which means the Linux scheduler wakes up every milisecond to re-time-slice running processes... a good default for interactive systems, but maybe too much added overhead/cost for longer running compute intensive stuff.

mikemccand avatar May 19 '25 13:05 mikemccand

Hi,

beast3:util.nightly[main]$ cat /proc/config.gz | gunzip | grep CONFIG_HZ
# CONFIG_HZ_PERIODIC is not set
# CONFIG_HZ_100 is not set
# CONFIG_HZ_250 is not set
# CONFIG_HZ_300 is not set
CONFIG_HZ_1000=y
CONFIG_HZ=1000

Which means the Linux scheduler wakes up every milisecond to re-time-slice running processes... a good default for interactive systems, but maybe too much added overhead/cost for longer running compute intensive stuff.

I never understood why this cannot be configured at runtime but needs compile-time changes. Something to put into /etc/sysctl.d/ would be ideal. Maybe it can be done. Will check!

uschindler avatar May 19 '25 15:05 uschindler

Wah: https://stackoverflow.com/questions/13427608/how-to-change-kernel-timer-frequency

uschindler avatar May 19 '25 15:05 uschindler

I'm pretty sure it boils down to a compile-time macro.

For arch linux on mike's machine with many cores, easiest solution is to switch to linux-lts kernel "with configuration options targeting usage in servers" https://wiki.archlinux.org/title/Kernel

This one has HZ=300 set in the configuration among other things.

rmuir avatar May 19 '25 15:05 rmuir

Hmm @yugushihuang (on my team (Amazon product search) team) found this is another way to query the kernel (our Amazon Linux 2 boxes seem not to have /proc/config.gz):

getconf CLK_TCK

The problem is, on beast3 it gives 100:

beast3:util.nightly[main]$ getconf CLK_TCK
100

When the kernel's config says 1000:

beast3:util.nightly[main]$ zcat /proc/config.gz | grep HZ
CONFIG_NO_HZ_COMMON=y
# CONFIG_HZ_PERIODIC is not set
# CONFIG_NO_HZ_IDLE is not set
CONFIG_NO_HZ_FULL=y
CONFIG_NO_HZ=y
# CONFIG_HZ_100 is not set
# CONFIG_HZ_250 is not set
# CONFIG_HZ_300 is not set
CONFIG_HZ_1000=y
CONFIG_HZ=1000
CONFIG_MACHZ_WDT=m

And I'm definitely on the new kernel:

beast3:util.nightly[main]$ uname -a
Linux beast3.mikemccandless.com 6.14.6-arch1-1 #1 SMP PREEMPT_DYNAMIC Fri, 09 May 2025 17:36:18 +0000 x86_64 GNU/Linux

Why is CLK_TCK different from CONFIG_HZ...? There is some confusing discussion here...

Hmm it looks like CLK_TCK is deprecatedand apps should instead use sysconf(_SC_CLK_TCK) -- maybe that gives the right answer (matching CONFIG_HZ at kernel compile time?)

mikemccand avatar May 22 '25 11:05 mikemccand

It looks like this CONFIG_HZ change also impacted KNN indexing throughput: https://benchmarks.mikemccandless.com/knnResults.html

mikemccand avatar May 26 '25 13:05 mikemccand

It looks like nightly benchmarks only run every 2 days since May 13th, vs. every day before that. Is this because it now takes longer to run the benchmark?

jpountz avatar May 27 '25 19:05 jpountz

It looks like nightly benchmarks only run every 2 days since May 13th, vs. every day before that. Is this because it now takes longer to run the benchmark?

Oh sorry I think it's fixed now -- it was a silly reason -- I had run an ad-hoc run, which ran past the normal (6 PM EST) benchy kickoff, which was therefore skipped (I think?), and I kept noticing this and would kick off another ad hoc one (apparently once every two days!). But I finally got it to kick off on its normal schedule, so it's been running daily for a while (though, hmm, now it's broken (differently!) ... I'll fix).

mikemccand avatar Jun 05 '25 17:06 mikemccand

Thank you @mikemccand !

jpountz avatar Jun 05 '25 20:06 jpountz