lucene Align vectors on disk for optimal performance on ARM CPUs

Description

Today, float vectors are aligned to 4 bytes in a Lucene index, but with Panama -- we can work with (upto) 512 bits (== 64 bytes, or 16 floats) at the same time.

I wonder if we should change this alignment to 64 bytes, in order to get optimal vector search performance with Panama?

Oct 16 '25 18:10 kaivalnp

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

Oct 16 '25 18:10 github-actions[bot]

I wrote a small JMH benchmark to "pad" float vectors on disk with some padBytes:

Benchmark                                     (padBytes)  (size)   Mode  Cnt   Score   Error   Units
VectorScorerBenchmark.floatDotProductMemSeg            0     256  thrpt   15  32.848 ± 0.098  ops/us
VectorScorerBenchmark.floatDotProductMemSeg            1     256  thrpt   15  24.158 ± 0.410  ops/us
VectorScorerBenchmark.floatDotProductMemSeg            2     256  thrpt   15  24.055 ± 0.291  ops/us
VectorScorerBenchmark.floatDotProductMemSeg            4     256  thrpt   15  26.381 ± 0.078  ops/us
VectorScorerBenchmark.floatDotProductMemSeg            6     256  thrpt   15  23.772 ± 0.785  ops/us
VectorScorerBenchmark.floatDotProductMemSeg            8     256  thrpt   15  26.666 ± 0.039  ops/us
VectorScorerBenchmark.floatDotProductMemSeg           16     256  thrpt   15  26.753 ± 0.112  ops/us
VectorScorerBenchmark.floatDotProductMemSeg           20     256  thrpt   15  26.489 ± 0.229  ops/us
VectorScorerBenchmark.floatDotProductMemSeg           32     256  thrpt   15  32.805 ± 0.106  ops/us
VectorScorerBenchmark.floatDotProductMemSeg           50     256  thrpt   15  24.651 ± 0.556  ops/us
VectorScorerBenchmark.floatDotProductMemSeg           64     256  thrpt   15  32.762 ± 0.376  ops/us
VectorScorerBenchmark.floatDotProductMemSeg          100     256  thrpt   15  25.888 ± 0.069  ops/us
VectorScorerBenchmark.floatDotProductMemSeg          128     256  thrpt   15  32.874 ± 0.065  ops/us
VectorScorerBenchmark.floatDotProductMemSeg          255     256  thrpt   15  24.906 ± 0.120  ops/us
VectorScorerBenchmark.floatDotProductMemSeg          256     256  thrpt   15  32.780 ± 0.091  ops/us

My machine uses the 256-bit variant of Panama to score vectors, so I saw optimal performance when floats are aligned to 32 bytes -- but keeping it 64 here as the max case..

Oct 16 '25 18:10 kaivalnp

cc @mikemccand who found this^ byte-misalignment possibility offline!

Oct 16 '25 18:10 kaivalnp

Also noting that for byte vectors, I saw no impact of padding:

Benchmark                                     (padBytes)  (size)   Mode  Cnt   Score   Error   Units
VectorScorerBenchmark.binaryDotProductMemSeg           0     256  thrpt   15  20.453 ± 0.171  ops/us
VectorScorerBenchmark.binaryDotProductMemSeg           1     256  thrpt   15  20.651 ± 0.177  ops/us
VectorScorerBenchmark.binaryDotProductMemSeg           2     256  thrpt   15  20.601 ± 0.150  ops/us
VectorScorerBenchmark.binaryDotProductMemSeg           4     256  thrpt   15  20.602 ± 0.163  ops/us
VectorScorerBenchmark.binaryDotProductMemSeg           6     256  thrpt   15  20.677 ± 0.252  ops/us
VectorScorerBenchmark.binaryDotProductMemSeg           8     256  thrpt   15  20.395 ± 0.154  ops/us
VectorScorerBenchmark.binaryDotProductMemSeg          16     256  thrpt   15  20.368 ± 0.122  ops/us
VectorScorerBenchmark.binaryDotProductMemSeg          20     256  thrpt   15  20.364 ± 0.089  ops/us
VectorScorerBenchmark.binaryDotProductMemSeg          32     256  thrpt   15  20.337 ± 0.041  ops/us
VectorScorerBenchmark.binaryDotProductMemSeg          50     256  thrpt   15  20.617 ± 0.139  ops/us
VectorScorerBenchmark.binaryDotProductMemSeg          64     256  thrpt   15  20.557 ± 0.272  ops/us
VectorScorerBenchmark.binaryDotProductMemSeg         100     256  thrpt   15  20.770 ± 0.233  ops/us
VectorScorerBenchmark.binaryDotProductMemSeg         128     256  thrpt   15  20.487 ± 0.151  ops/us
VectorScorerBenchmark.binaryDotProductMemSeg         255     256  thrpt   15  20.419 ± 0.118  ops/us
VectorScorerBenchmark.binaryDotProductMemSeg         256     256  thrpt   15  20.644 ± 0.390  ops/us

..so I'm not changing its alignment in this PR

Oct 16 '25 19:10 kaivalnp

Wow, alignment still matters, and it matters a lot (24 -> 33 ops/us)! Thank you @kaivalnp for testing. Was this an aarch64 CPU (Graviton 3 or 4?).

It's frustrating how the CPU just silently runs slower ... but what else could it do.

I wonder whether modern x86-64 (Intel, AMD) CPUs also show this effect. I'll test this PR on nightly Lucene benchy box (beast3).

Oct 17 '25 14:10 mikemccand

Was this an aarch64 CPU (Graviton 3 or 4?)

Yes, it was a Graviton3 (m7g) CPU. lscpu says:

Architecture:        aarch64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              64
On-line CPU(s) list: 0-63
Thread(s) per core:  1
Core(s) per socket:  64
Socket(s):           1
NUMA node(s):        1
Vendor ID:           ARM
Model:               1
Stepping:            r1p1
BogoMIPS:            2100.00
L1d cache:           64K
L1i cache:           64K
L2 cache:            1024K
L3 cache:            32768K
NUMA node0 CPU(s):   0-63
Flags:               fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm paca pacg dcpodp svei8mm svebf16 i8mm bf16 dgh rng

Oct 17 '25 15:10 kaivalnp

I tested on beast3 (nightly benchmarking box) -- a Ryzen Threadripper 3990X:

processor       : 127
vendor_id       : AuthenticAMD
cpu family      : 23
model           : 49
model name      : AMD Ryzen Threadripper 3990X 64-Core Processor
stepping        : 0
microcode       : 0x830107c
cpu MHz         : 2900.000
cache size      : 512 KB
physical id     : 0
siblings        : 128
core id         : 63
cpu cores       : 64
apicid          : 127
initial apicid  : 127
fpu             : yes
fpu_exception   : yes
cpuid level     : 16
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pcl\
mulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 c\
dp_l3 hw_pstate ssbd mba ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd arat npt lb\
rv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_recov succor smca sev sev_es
bugs            : sysret_ss_attrs spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret spectre_v2_user
bogomips        : 5788.93
TLB size        : 3072 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 43 bits physical, 48 bits virtual
power management: ts ttp tm hwpstate cpb eff_freq_ro [13] [14]

I applied this PR, built (./gradlew :lucene:benchmark-jmh:assemble), and ran java --module-path lucene/benchmark-jmh/build/benchmarks --module org.apache.lucene.benchmark.jmh VectorScorerBenchmark -p size=256, and got:

Benchmark                                      (padBytes)  (size)   Mode  Cnt   Score   Error   Units
VectorScorerBenchmark.binaryDotProductDefault           0     256  thrpt   15   8.601 ± 0.047  ops/us
VectorScorerBenchmark.binaryDotProductDefault           1     256  thrpt   15   8.573 ± 0.058  ops/us
VectorScorerBenchmark.binaryDotProductDefault           2     256  thrpt   15   8.593 ± 0.023  ops/us
VectorScorerBenchmark.binaryDotProductDefault           4     256  thrpt   15   8.579 ± 0.026  ops/us
VectorScorerBenchmark.binaryDotProductDefault           6     256  thrpt   15   8.584 ± 0.026  ops/us
VectorScorerBenchmark.binaryDotProductDefault           8     256  thrpt   15   8.605 ± 0.019  ops/us
VectorScorerBenchmark.binaryDotProductDefault          16     256  thrpt   15   8.603 ± 0.034  ops/us
VectorScorerBenchmark.binaryDotProductDefault          20     256  thrpt   15   8.583 ± 0.031  ops/us
VectorScorerBenchmark.binaryDotProductDefault          32     256  thrpt   15   8.581 ± 0.030  ops/us
VectorScorerBenchmark.binaryDotProductDefault          50     256  thrpt   15   8.591 ± 0.052  ops/us
VectorScorerBenchmark.binaryDotProductDefault          64     256  thrpt   15   8.611 ± 0.033  ops/us
VectorScorerBenchmark.binaryDotProductDefault         100     256  thrpt   15   8.594 ± 0.051  ops/us
VectorScorerBenchmark.binaryDotProductDefault         128     256  thrpt   15   8.620 ± 0.032  ops/us
VectorScorerBenchmark.binaryDotProductDefault         255     256  thrpt   15   8.597 ± 0.026  ops/us
VectorScorerBenchmark.binaryDotProductDefault         256     256  thrpt   15   8.605 ± 0.056  ops/us
VectorScorerBenchmark.binaryDotProductMemSeg            0     256  thrpt   15  25.203 ± 1.850  ops/us
VectorScorerBenchmark.binaryDotProductMemSeg            1     256  thrpt   15  25.961 ± 0.047  ops/us
VectorScorerBenchmark.binaryDotProductMemSeg            2     256  thrpt   15  25.314 ± 1.959  ops/us
VectorScorerBenchmark.binaryDotProductMemSeg            4     256  thrpt   15  25.958 ± 0.067  ops/us
VectorScorerBenchmark.binaryDotProductMemSeg            6     256  thrpt   15  25.295 ± 1.977  ops/us
VectorScorerBenchmark.binaryDotProductMemSeg            8     256  thrpt   15  26.122 ± 0.073  ops/us
VectorScorerBenchmark.binaryDotProductMemSeg           16     256  thrpt   15  26.056 ± 0.184  ops/us
VectorScorerBenchmark.binaryDotProductMemSeg           20     256  thrpt   15  25.848 ± 1.589  ops/us
VectorScorerBenchmark.binaryDotProductMemSeg           32     256  thrpt   15  25.817 ± 0.417  ops/us
VectorScorerBenchmark.binaryDotProductMemSeg           50     256  thrpt   15  26.065 ± 0.585  ops/us
VectorScorerBenchmark.binaryDotProductMemSeg           64     256  thrpt   15  26.045 ± 0.162  ops/us
VectorScorerBenchmark.binaryDotProductMemSeg          100     256  thrpt   15  26.093 ± 0.061  ops/us
VectorScorerBenchmark.binaryDotProductMemSeg          128     256  thrpt   15  26.101 ± 0.090  ops/us
VectorScorerBenchmark.binaryDotProductMemSeg          255     256  thrpt   15  26.028 ± 0.088  ops/us
VectorScorerBenchmark.binaryDotProductMemSeg          256     256  thrpt   15  26.027 ± 0.301  ops/us
VectorScorerBenchmark.floatDotProductDefault            0     256  thrpt   15  15.241 ± 0.010  ops/us
VectorScorerBenchmark.floatDotProductDefault            1     256  thrpt   15  15.169 ± 0.232  ops/us
VectorScorerBenchmark.floatDotProductDefault            2     256  thrpt   15  15.230 ± 0.082  ops/us
VectorScorerBenchmark.floatDotProductDefault            4     256  thrpt   15  15.231 ± 0.034  ops/us
VectorScorerBenchmark.floatDotProductDefault            6     256  thrpt   15  15.229 ± 0.048  ops/us
VectorScorerBenchmark.floatDotProductDefault            8     256  thrpt   15  15.216 ± 0.091  ops/us
VectorScorerBenchmark.floatDotProductDefault           16     256  thrpt   15  15.278 ± 0.048  ops/us
VectorScorerBenchmark.floatDotProductDefault           20     256  thrpt   15  15.058 ± 0.711  ops/us
VectorScorerBenchmark.floatDotProductDefault           32     256  thrpt   15  15.192 ± 0.100  ops/us
VectorScorerBenchmark.floatDotProductDefault           50     256  thrpt   15  15.300 ± 0.047  ops/us
VectorScorerBenchmark.floatDotProductDefault           64     256  thrpt   15  15.257 ± 0.083  ops/us
VectorScorerBenchmark.floatDotProductDefault          100     256  thrpt   15  15.272 ± 0.038  ops/us
VectorScorerBenchmark.floatDotProductDefault          128     256  thrpt   15  15.144 ± 0.529  ops/us
VectorScorerBenchmark.floatDotProductDefault          255     256  thrpt   15  15.248 ± 0.024  ops/us
VectorScorerBenchmark.floatDotProductDefault          256     256  thrpt   15  15.276 ± 0.039  ops/us
VectorScorerBenchmark.floatDotProductMemSeg             0     256  thrpt   15  20.360 ± 0.077  ops/us
VectorScorerBenchmark.floatDotProductMemSeg             1     256  thrpt   15  20.252 ± 0.177  ops/us
VectorScorerBenchmark.floatDotProductMemSeg             2     256  thrpt   15  20.281 ± 0.060  ops/us
VectorScorerBenchmark.floatDotProductMemSeg             4     256  thrpt   15  20.261 ± 0.048  ops/us
VectorScorerBenchmark.floatDotProductMemSeg             6     256  thrpt   15  20.285 ± 0.063  ops/us
VectorScorerBenchmark.floatDotProductMemSeg             8     256  thrpt   15  20.359 ± 0.072  ops/us
VectorScorerBenchmark.floatDotProductMemSeg            16     256  thrpt   15  20.344 ± 0.078  ops/us
VectorScorerBenchmark.floatDotProductMemSeg            20     256  thrpt   15  20.272 ± 0.090  ops/us
VectorScorerBenchmark.floatDotProductMemSeg            32     256  thrpt   15  20.413 ± 0.010  ops/us
VectorScorerBenchmark.floatDotProductMemSeg            50     256  thrpt   15  20.066 ± 0.051  ops/us
VectorScorerBenchmark.floatDotProductMemSeg            64     256  thrpt   15  20.386 ± 0.051  ops/us
VectorScorerBenchmark.floatDotProductMemSeg           100     256  thrpt   15  20.029 ± 0.095  ops/us
VectorScorerBenchmark.floatDotProductMemSeg           128     256  thrpt   15  20.348 ± 0.049  ops/us
VectorScorerBenchmark.floatDotProductMemSeg           255     256  thrpt   15  20.047 ± 0.101  ops/us
VectorScorerBenchmark.floatDotProductMemSeg           256     256  thrpt   15  20.335 ± 0.037  ops/us

Net/net it seems like alignment of the mapped in-ram (virtual address space) doesn't matter?

I also tested newer CPU (Raptor Lake) -- I'll post that shortly.

Oct 20 '25 15:10 mikemccand

Raptor Lake box is i9-13900K:

processor       : 31
vendor_id       : GenuineIntel
cpu family      : 6
model           : 183
model name      : 13th Gen Intel(R) Core(TM) i9-13900K
stepping        : 1
microcode       : 0x12f
cpu MHz         : 800.000
cache size      : 36864 KB
physical id     : 0
siblings        : 32
core id         : 47
cpu cores       : 24
apicid          : 94
initial apicid  : 94
fpu             : yes
fpu_exception   : yes
cpuid level     : 32
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx sm\
x est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt\
 clwb intel_pt sha_ni xsaveopt xsavec xgetbv1 xsaves split_lock_detect user_shstk avx_vnni dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp hwp_pkg_req hfi vnmi umip pku ospke waitpkg gfni vaes vpclmulqdq rdpid movdiri movdir64b fsrm md_clear serialize pconfig arch_lbr ibt flush_l1d arch_capabilities
vmx flags       : vnmi preemption_timer posted_intr invvpid ept_x_only ept_ad ept_1gb flexpriority apicv tsc_offset vtpr mtf vapic ept vpid unrestricted_guest vapic_reg vid ple shadow_vmcs ept_violation_ve ept_mode_based_exec tsc_scaling usr_wait_pause
bugs            : spectre_v1 spectre_v2 spec_store_bypass swapgs eibrs_pbrsb rfds bhi spectre_v2_user
bogomips        : 5990.40
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:

Results:

Benchmark                                      (padBytes)  (size)   Mode  Cnt   Score   Error   Units
VectorScorerBenchmark.binaryDotProductDefault           0     256  thrpt   15  14.037 ± 0.061  ops/us
VectorScorerBenchmark.binaryDotProductDefault           1     256  thrpt   15  14.046 ± 0.071  ops/us
VectorScorerBenchmark.binaryDotProductDefault           2     256  thrpt   15  14.139 ± 0.089  ops/us
VectorScorerBenchmark.binaryDotProductDefault           4     256  thrpt   15  14.069 ± 0.040  ops/us
VectorScorerBenchmark.binaryDotProductDefault           6     256  thrpt   15  14.038 ± 0.072  ops/us
VectorScorerBenchmark.binaryDotProductDefault           8     256  thrpt   15  14.094 ± 0.070  ops/us
VectorScorerBenchmark.binaryDotProductDefault          16     256  thrpt   15  14.073 ± 0.059  ops/us
VectorScorerBenchmark.binaryDotProductDefault          20     256  thrpt   15  14.134 ± 0.075  ops/us
VectorScorerBenchmark.binaryDotProductDefault          32     256  thrpt   15  14.016 ± 0.044  ops/us
VectorScorerBenchmark.binaryDotProductDefault          50     256  thrpt   15  14.031 ± 0.046  ops/us
VectorScorerBenchmark.binaryDotProductDefault          64     256  thrpt   15  14.082 ± 0.068  ops/us
VectorScorerBenchmark.binaryDotProductDefault         100     256  thrpt   15  14.013 ± 0.059  ops/us
VectorScorerBenchmark.binaryDotProductDefault         128     256  thrpt   15  14.079 ± 0.069  ops/us
VectorScorerBenchmark.binaryDotProductDefault         255     256  thrpt   15  14.143 ± 0.074  ops/us
VectorScorerBenchmark.binaryDotProductDefault         256     256  thrpt   15  14.026 ± 0.028  ops/us
VectorScorerBenchmark.binaryDotProductMemSeg            0     256  thrpt   15  49.305 ± 0.244  ops/us
VectorScorerBenchmark.binaryDotProductMemSeg            1     256  thrpt   15  48.572 ± 0.030  ops/us
VectorScorerBenchmark.binaryDotProductMemSeg            2     256  thrpt   15  48.508 ± 0.198  ops/us
VectorScorerBenchmark.binaryDotProductMemSeg            4     256  thrpt   15  48.636 ± 0.094  ops/us
VectorScorerBenchmark.binaryDotProductMemSeg            6     256  thrpt   15  48.536 ± 0.185  ops/us
VectorScorerBenchmark.binaryDotProductMemSeg            8     256  thrpt   15  49.346 ± 0.166  ops/us
VectorScorerBenchmark.binaryDotProductMemSeg           16     256  thrpt   15  49.419 ± 0.102  ops/us
VectorScorerBenchmark.binaryDotProductMemSeg           20     256  thrpt   15  49.224 ± 0.396  ops/us
VectorScorerBenchmark.binaryDotProductMemSeg           32     256  thrpt   15  49.423 ± 0.134  ops/us
VectorScorerBenchmark.binaryDotProductMemSeg           50     256  thrpt   15  48.676 ± 0.167  ops/us
VectorScorerBenchmark.binaryDotProductMemSeg           64     256  thrpt   15  49.060 ± 0.866  ops/us
VectorScorerBenchmark.binaryDotProductMemSeg          100     256  thrpt   15  49.181 ± 0.210  ops/us
VectorScorerBenchmark.binaryDotProductMemSeg          128     256  thrpt   15  49.444 ± 0.082  ops/us
VectorScorerBenchmark.binaryDotProductMemSeg          255     256  thrpt   15  48.362 ± 0.163  ops/us
VectorScorerBenchmark.binaryDotProductMemSeg          256     256  thrpt   15  48.169 ± 5.291  ops/us
VectorScorerBenchmark.floatDotProductDefault            0     256  thrpt   15  23.215 ± 0.023  ops/us
VectorScorerBenchmark.floatDotProductDefault            1     256  thrpt   15  23.207 ± 0.067  ops/us
VectorScorerBenchmark.floatDotProductDefault            2     256  thrpt   15  23.181 ± 0.086  ops/us
VectorScorerBenchmark.floatDotProductDefault            4     256  thrpt   15  23.156 ± 0.290  ops/us
VectorScorerBenchmark.floatDotProductDefault            6     256  thrpt   15  23.232 ± 0.012  ops/us
VectorScorerBenchmark.floatDotProductDefault            8     256  thrpt   15  23.215 ± 0.091  ops/us
VectorScorerBenchmark.floatDotProductDefault           16     256  thrpt   15  23.194 ± 0.071  ops/us
VectorScorerBenchmark.floatDotProductDefault           20     256  thrpt   15  23.202 ± 0.083  ops/us
VectorScorerBenchmark.floatDotProductDefault           32     256  thrpt   15  23.207 ± 0.048  ops/us
VectorScorerBenchmark.floatDotProductDefault           50     256  thrpt   15  23.227 ± 0.031  ops/us
VectorScorerBenchmark.floatDotProductDefault           64     256  thrpt   15  23.187 ± 0.095  ops/us
VectorScorerBenchmark.floatDotProductDefault          100     256  thrpt   15  23.246 ± 0.114  ops/us
VectorScorerBenchmark.floatDotProductDefault          128     256  thrpt   15  23.214 ± 0.077  ops/us
VectorScorerBenchmark.floatDotProductDefault          255     256  thrpt   15  23.212 ± 0.035  ops/us
VectorScorerBenchmark.floatDotProductDefault          256     256  thrpt   15  23.239 ± 0.117  ops/us
VectorScorerBenchmark.floatDotProductMemSeg             0     256  thrpt   15  53.514 ± 5.159  ops/us
VectorScorerBenchmark.floatDotProductMemSeg             1     256  thrpt   15  49.594 ± 3.885  ops/us
VectorScorerBenchmark.floatDotProductMemSeg             2     256  thrpt   15  50.504 ± 0.122  ops/us
VectorScorerBenchmark.floatDotProductMemSeg             4     256  thrpt   15  51.385 ± 4.406  ops/us
VectorScorerBenchmark.floatDotProductMemSeg             6     256  thrpt   15  50.497 ± 0.146  ops/us
VectorScorerBenchmark.floatDotProductMemSeg             8     256  thrpt   15  52.327 ± 0.292  ops/us
VectorScorerBenchmark.floatDotProductMemSeg            16     256  thrpt   15  51.401 ± 4.426  ops/us
VectorScorerBenchmark.floatDotProductMemSeg            20     256  thrpt   15  52.373 ± 0.307  ops/us
VectorScorerBenchmark.floatDotProductMemSeg            32     256  thrpt   15  54.779 ± 0.078  ops/us
VectorScorerBenchmark.floatDotProductMemSeg            50     256  thrpt   15  49.447 ± 1.502  ops/us
VectorScorerBenchmark.floatDotProductMemSeg            64     256  thrpt   15  54.788 ± 0.060  ops/us
VectorScorerBenchmark.floatDotProductMemSeg           100     256  thrpt   15  51.600 ± 0.352  ops/us
VectorScorerBenchmark.floatDotProductMemSeg           128     256  thrpt   15  54.650 ± 0.377  ops/us
VectorScorerBenchmark.floatDotProductMemSeg           255     256  thrpt   15  50.042 ± 0.167  ops/us
VectorScorerBenchmark.floatDotProductMemSeg           256     256  thrpt   15  54.583 ± 0.399  ops/us

There might be small some mis-alignment penalty for float SIMD?

Oct 20 '25 15:10 mikemccand

Yes, it was a Graviton3 (m7g) CPU. lscpu says:

from the cpu's optimization guide: Screen_Shot_2025-10-20_at_13 56 31

Oct 20 '25 17:10 rmuir

Thanks @mikemccand, there doesn't seem to be any performance penalty on "beast3 (nightly benchmarking box) -- a Ryzen Threadripper 3990X". There's definitely some impact of alignment on "Raptor Lake box is i9-13900K", but this is lower than my machine (<10%) -- so this alignment issue is mostly on Graviton, or ARM CPUs in general, as @rmuir shared?

I tried running knnPerfTest.py on Cohere vectors (768d) with DOT_PRODUCT similarity

main (4-byte-alignment)

recall  latency(ms)  netCPU  avgCpuCount     nDoc  topK  fanout  maxConn  beamWidth  quantized  visited  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.890        1.917   1.908        0.995  1000000   100      50       32        250         no     5388     67.66      14780.22          130.41             1         3014.60      2929.688     2929.688       HNSW

This PR (64-byte-alignment)

recall  latency(ms)  netCPU  avgCpuCount     nDoc  topK  fanout  maxConn  beamWidth  quantized  visited  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.891        1.845   1.836        0.995  1000000   100      50       32        250         no     5403     62.48      16004.61          130.03             1         3014.90      2929.688     2929.688       HNSW

Indexing was sped up by ~7.6%, while Search was sped up by ~3.8%

I see another action item from this benchmark: I wasn't aligning the output inside this merge function, which is used by HNSW-based vector formats for merging (see that index(s) improved in my benchmark, but not force_merge(s) -- which should speed up after this additional change?)

Oct 20 '25 18:10 kaivalnp

Just as a general comment around this performance: the 256-bit SVE vectors available on these processors have unfortunately not had a lot of love from our side.

Lots of digging into the 128-bit neons on the mac and the various AVX on intel, but not much yet on the SVE.

I also think they are in early stages on the java side: I see a lot of mail traffic making improvements to these vectors in the OpenJDK, especially that might impact the integer side (e.g. compress). Might even be worth trying to build snapshot of JDK.

Oct 21 '25 05:10 rmuir

Thanks @rmuir. How does the Panama Vector API handle alignment? Does it have methods to allocate aligned on-heap or off-heap vectors? Hmm it looks like SegmentAllocator has an allocate method that takes byteAlignment, so it is possible in pure Java.

Oct 21 '25 13:10 mikemccand

I see another action item from this benchmark: I wasn't aligning the output inside this merge function, which is used by HNSW-based vector formats for merging (see that index(s) improved in my benchmark, but not force_merge(s) -- which should speed up after this additional change?)

Oh good catch! I wonder what other places might write the flat vectors?

Is the alignment also (or maybe less) important for the quantized cases? (Your results above are for float32 vectors?)

Maybe at least luceneutil could somehow warn if vectors are unaligned during its perf testing?

Oct 21 '25 13:10 mikemccand

I wasn't aligning the output inside this merge function

Hmm this did not help for some reason (merge time increased)..

main (4-byte-alignment)

recall  latency(ms)  netCPU  avgCpuCount     nDoc  topK  fanout  maxConn  beamWidth  quantized  visited  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.891        1.895   1.887        0.996  1000000   100      50       32        250         no     5408     72.35      13822.46          101.30             1         3014.54      2929.688     2929.688       HNSW

This PR (64-byte-alignment)

recall  latency(ms)  netCPU  avgCpuCount     nDoc  topK  fanout  maxConn  beamWidth  quantized  visited  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.893        1.868   1.860        0.996  1000000   100      50       32        250         no     5426     69.55      14378.97          127.48             1         3015.40      2929.688     2929.688       HNSW

I'll still add a commit + revert, so people can see what I tried, and comment if I'm missing something!

Is the alignment also (or maybe less) important for the quantized cases?

I think alignment is less important for quantized vectors (which are stored as byte vectors on disk) -- because none of the JMH benchmarks show non-trivial variation with padding? (see VectorScorerBenchmark.binaryDotProductMemSeg)

Your results above are for float32 vectors?

Yeah, those benchmarks^ are for float vectors

Maybe at least luceneutil could somehow warn if vectors are unaligned during its perf testing?

I added some print statements to complain if non-64-byte-aligned addresses were used (MemorySegment#address % 64 == 0) -- and it complained in (only) baseline benchmarks..

Not committing because it may not be needed after this PR?

Oct 21 '25 17:10 kaivalnp

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

Oct 21 '25 17:10 github-actions[bot]

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

Nov 05 '25 00:11 github-actions[bot]

Sorry for the delay here, I ran benchmarks a few more times offline, and the differences in index(s) and force_merge(s) seem to be noisy (they take about the same time in main v/s this PR on average)

This is because:

During indexing, the HNSW graph is built on-heap -- so there's no impact of alignment
During merging, we create a new temp file to write vectors merged from all segments -- which is then used to score vectors during graph creation in the new segment -- and it starts at offset 0 (i.e. already aligned)

The only constant is the speedup in search time (3-4%)

Another recent run with 1M Cohere vectors, 768d, DOT_PRODUCT, force merged into a single segment:

main:

recall  latency(ms)  netCPU  avgCpuCount     nDoc  topK  fanout  maxConn  beamWidth  quantized  visited  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.892        1.880   1.872        0.996  1000000   100      50       32        250         no     5419    161.46       6193.68          197.68             1         3012.18      2929.688     2929.688       HNSW

This PR:

recall  latency(ms)  netCPU  avgCpuCount     nDoc  topK  fanout  maxConn  beamWidth  quantized  visited  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.891        1.824   1.815        0.995  1000000   100      50       32        250         no     5408    162.23       6163.94          197.08             1         3012.26      2929.688     2929.688       HNSW

Nov 12 '25 04:11 kaivalnp

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

Dec 18 '25 00:12 github-actions[bot]