Align vectors on disk for optimal performance on ARM CPUs
Description
Today, float vectors are aligned to 4 bytes in a Lucene index, but with Panama -- we can work with (upto) 512 bits (== 64 bytes, or 16 floats) at the same time.
I wonder if we should change this alignment to 64 bytes, in order to get optimal vector search performance with Panama?
This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.
I wrote a small JMH benchmark to "pad" float vectors on disk with some padBytes:
Benchmark (padBytes) (size) Mode Cnt Score Error Units
VectorScorerBenchmark.floatDotProductMemSeg 0 256 thrpt 15 32.848 ± 0.098 ops/us
VectorScorerBenchmark.floatDotProductMemSeg 1 256 thrpt 15 24.158 ± 0.410 ops/us
VectorScorerBenchmark.floatDotProductMemSeg 2 256 thrpt 15 24.055 ± 0.291 ops/us
VectorScorerBenchmark.floatDotProductMemSeg 4 256 thrpt 15 26.381 ± 0.078 ops/us
VectorScorerBenchmark.floatDotProductMemSeg 6 256 thrpt 15 23.772 ± 0.785 ops/us
VectorScorerBenchmark.floatDotProductMemSeg 8 256 thrpt 15 26.666 ± 0.039 ops/us
VectorScorerBenchmark.floatDotProductMemSeg 16 256 thrpt 15 26.753 ± 0.112 ops/us
VectorScorerBenchmark.floatDotProductMemSeg 20 256 thrpt 15 26.489 ± 0.229 ops/us
VectorScorerBenchmark.floatDotProductMemSeg 32 256 thrpt 15 32.805 ± 0.106 ops/us
VectorScorerBenchmark.floatDotProductMemSeg 50 256 thrpt 15 24.651 ± 0.556 ops/us
VectorScorerBenchmark.floatDotProductMemSeg 64 256 thrpt 15 32.762 ± 0.376 ops/us
VectorScorerBenchmark.floatDotProductMemSeg 100 256 thrpt 15 25.888 ± 0.069 ops/us
VectorScorerBenchmark.floatDotProductMemSeg 128 256 thrpt 15 32.874 ± 0.065 ops/us
VectorScorerBenchmark.floatDotProductMemSeg 255 256 thrpt 15 24.906 ± 0.120 ops/us
VectorScorerBenchmark.floatDotProductMemSeg 256 256 thrpt 15 32.780 ± 0.091 ops/us
My machine uses the 256-bit variant of Panama to score vectors, so I saw optimal performance when floats are aligned to 32 bytes -- but keeping it 64 here as the max case..
cc @mikemccand who found this^ byte-misalignment possibility offline!
Also noting that for byte vectors, I saw no impact of padding:
Benchmark (padBytes) (size) Mode Cnt Score Error Units
VectorScorerBenchmark.binaryDotProductMemSeg 0 256 thrpt 15 20.453 ± 0.171 ops/us
VectorScorerBenchmark.binaryDotProductMemSeg 1 256 thrpt 15 20.651 ± 0.177 ops/us
VectorScorerBenchmark.binaryDotProductMemSeg 2 256 thrpt 15 20.601 ± 0.150 ops/us
VectorScorerBenchmark.binaryDotProductMemSeg 4 256 thrpt 15 20.602 ± 0.163 ops/us
VectorScorerBenchmark.binaryDotProductMemSeg 6 256 thrpt 15 20.677 ± 0.252 ops/us
VectorScorerBenchmark.binaryDotProductMemSeg 8 256 thrpt 15 20.395 ± 0.154 ops/us
VectorScorerBenchmark.binaryDotProductMemSeg 16 256 thrpt 15 20.368 ± 0.122 ops/us
VectorScorerBenchmark.binaryDotProductMemSeg 20 256 thrpt 15 20.364 ± 0.089 ops/us
VectorScorerBenchmark.binaryDotProductMemSeg 32 256 thrpt 15 20.337 ± 0.041 ops/us
VectorScorerBenchmark.binaryDotProductMemSeg 50 256 thrpt 15 20.617 ± 0.139 ops/us
VectorScorerBenchmark.binaryDotProductMemSeg 64 256 thrpt 15 20.557 ± 0.272 ops/us
VectorScorerBenchmark.binaryDotProductMemSeg 100 256 thrpt 15 20.770 ± 0.233 ops/us
VectorScorerBenchmark.binaryDotProductMemSeg 128 256 thrpt 15 20.487 ± 0.151 ops/us
VectorScorerBenchmark.binaryDotProductMemSeg 255 256 thrpt 15 20.419 ± 0.118 ops/us
VectorScorerBenchmark.binaryDotProductMemSeg 256 256 thrpt 15 20.644 ± 0.390 ops/us
..so I'm not changing its alignment in this PR
Wow, alignment still matters, and it matters a lot (24 -> 33 ops/us)! Thank you @kaivalnp for testing. Was this an aarch64 CPU (Graviton 3 or 4?).
It's frustrating how the CPU just silently runs slower ... but what else could it do.
I wonder whether modern x86-64 (Intel, AMD) CPUs also show this effect. I'll test this PR on nightly Lucene benchy box (beast3).
Was this an
aarch64CPU (Graviton 3 or 4?)
Yes, it was a Graviton3 (m7g) CPU. lscpu says:
Architecture: aarch64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 64
On-line CPU(s) list: 0-63
Thread(s) per core: 1
Core(s) per socket: 64
Socket(s): 1
NUMA node(s): 1
Vendor ID: ARM
Model: 1
Stepping: r1p1
BogoMIPS: 2100.00
L1d cache: 64K
L1i cache: 64K
L2 cache: 1024K
L3 cache: 32768K
NUMA node0 CPU(s): 0-63
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm paca pacg dcpodp svei8mm svebf16 i8mm bf16 dgh rng
I tested on beast3 (nightly benchmarking box) -- a Ryzen Threadripper 3990X:
processor : 127
vendor_id : AuthenticAMD
cpu family : 23
model : 49
model name : AMD Ryzen Threadripper 3990X 64-Core Processor
stepping : 0
microcode : 0x830107c
cpu MHz : 2900.000
cache size : 512 KB
physical id : 0
siblings : 128
core id : 63
cpu cores : 64
apicid : 127
initial apicid : 127
fpu : yes
fpu_exception : yes
cpuid level : 16
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pcl\
mulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 c\
dp_l3 hw_pstate ssbd mba ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd arat npt lb\
rv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_recov succor smca sev sev_es
bugs : sysret_ss_attrs spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret spectre_v2_user
bogomips : 5788.93
TLB size : 3072 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 43 bits physical, 48 bits virtual
power management: ts ttp tm hwpstate cpb eff_freq_ro [13] [14]
I applied this PR, built (./gradlew :lucene:benchmark-jmh:assemble), and ran java --module-path lucene/benchmark-jmh/build/benchmarks --module org.apache.lucene.benchmark.jmh VectorScorerBenchmark -p size=256, and got:
Benchmark (padBytes) (size) Mode Cnt Score Error Units
VectorScorerBenchmark.binaryDotProductDefault 0 256 thrpt 15 8.601 ± 0.047 ops/us
VectorScorerBenchmark.binaryDotProductDefault 1 256 thrpt 15 8.573 ± 0.058 ops/us
VectorScorerBenchmark.binaryDotProductDefault 2 256 thrpt 15 8.593 ± 0.023 ops/us
VectorScorerBenchmark.binaryDotProductDefault 4 256 thrpt 15 8.579 ± 0.026 ops/us
VectorScorerBenchmark.binaryDotProductDefault 6 256 thrpt 15 8.584 ± 0.026 ops/us
VectorScorerBenchmark.binaryDotProductDefault 8 256 thrpt 15 8.605 ± 0.019 ops/us
VectorScorerBenchmark.binaryDotProductDefault 16 256 thrpt 15 8.603 ± 0.034 ops/us
VectorScorerBenchmark.binaryDotProductDefault 20 256 thrpt 15 8.583 ± 0.031 ops/us
VectorScorerBenchmark.binaryDotProductDefault 32 256 thrpt 15 8.581 ± 0.030 ops/us
VectorScorerBenchmark.binaryDotProductDefault 50 256 thrpt 15 8.591 ± 0.052 ops/us
VectorScorerBenchmark.binaryDotProductDefault 64 256 thrpt 15 8.611 ± 0.033 ops/us
VectorScorerBenchmark.binaryDotProductDefault 100 256 thrpt 15 8.594 ± 0.051 ops/us
VectorScorerBenchmark.binaryDotProductDefault 128 256 thrpt 15 8.620 ± 0.032 ops/us
VectorScorerBenchmark.binaryDotProductDefault 255 256 thrpt 15 8.597 ± 0.026 ops/us
VectorScorerBenchmark.binaryDotProductDefault 256 256 thrpt 15 8.605 ± 0.056 ops/us
VectorScorerBenchmark.binaryDotProductMemSeg 0 256 thrpt 15 25.203 ± 1.850 ops/us
VectorScorerBenchmark.binaryDotProductMemSeg 1 256 thrpt 15 25.961 ± 0.047 ops/us
VectorScorerBenchmark.binaryDotProductMemSeg 2 256 thrpt 15 25.314 ± 1.959 ops/us
VectorScorerBenchmark.binaryDotProductMemSeg 4 256 thrpt 15 25.958 ± 0.067 ops/us
VectorScorerBenchmark.binaryDotProductMemSeg 6 256 thrpt 15 25.295 ± 1.977 ops/us
VectorScorerBenchmark.binaryDotProductMemSeg 8 256 thrpt 15 26.122 ± 0.073 ops/us
VectorScorerBenchmark.binaryDotProductMemSeg 16 256 thrpt 15 26.056 ± 0.184 ops/us
VectorScorerBenchmark.binaryDotProductMemSeg 20 256 thrpt 15 25.848 ± 1.589 ops/us
VectorScorerBenchmark.binaryDotProductMemSeg 32 256 thrpt 15 25.817 ± 0.417 ops/us
VectorScorerBenchmark.binaryDotProductMemSeg 50 256 thrpt 15 26.065 ± 0.585 ops/us
VectorScorerBenchmark.binaryDotProductMemSeg 64 256 thrpt 15 26.045 ± 0.162 ops/us
VectorScorerBenchmark.binaryDotProductMemSeg 100 256 thrpt 15 26.093 ± 0.061 ops/us
VectorScorerBenchmark.binaryDotProductMemSeg 128 256 thrpt 15 26.101 ± 0.090 ops/us
VectorScorerBenchmark.binaryDotProductMemSeg 255 256 thrpt 15 26.028 ± 0.088 ops/us
VectorScorerBenchmark.binaryDotProductMemSeg 256 256 thrpt 15 26.027 ± 0.301 ops/us
VectorScorerBenchmark.floatDotProductDefault 0 256 thrpt 15 15.241 ± 0.010 ops/us
VectorScorerBenchmark.floatDotProductDefault 1 256 thrpt 15 15.169 ± 0.232 ops/us
VectorScorerBenchmark.floatDotProductDefault 2 256 thrpt 15 15.230 ± 0.082 ops/us
VectorScorerBenchmark.floatDotProductDefault 4 256 thrpt 15 15.231 ± 0.034 ops/us
VectorScorerBenchmark.floatDotProductDefault 6 256 thrpt 15 15.229 ± 0.048 ops/us
VectorScorerBenchmark.floatDotProductDefault 8 256 thrpt 15 15.216 ± 0.091 ops/us
VectorScorerBenchmark.floatDotProductDefault 16 256 thrpt 15 15.278 ± 0.048 ops/us
VectorScorerBenchmark.floatDotProductDefault 20 256 thrpt 15 15.058 ± 0.711 ops/us
VectorScorerBenchmark.floatDotProductDefault 32 256 thrpt 15 15.192 ± 0.100 ops/us
VectorScorerBenchmark.floatDotProductDefault 50 256 thrpt 15 15.300 ± 0.047 ops/us
VectorScorerBenchmark.floatDotProductDefault 64 256 thrpt 15 15.257 ± 0.083 ops/us
VectorScorerBenchmark.floatDotProductDefault 100 256 thrpt 15 15.272 ± 0.038 ops/us
VectorScorerBenchmark.floatDotProductDefault 128 256 thrpt 15 15.144 ± 0.529 ops/us
VectorScorerBenchmark.floatDotProductDefault 255 256 thrpt 15 15.248 ± 0.024 ops/us
VectorScorerBenchmark.floatDotProductDefault 256 256 thrpt 15 15.276 ± 0.039 ops/us
VectorScorerBenchmark.floatDotProductMemSeg 0 256 thrpt 15 20.360 ± 0.077 ops/us
VectorScorerBenchmark.floatDotProductMemSeg 1 256 thrpt 15 20.252 ± 0.177 ops/us
VectorScorerBenchmark.floatDotProductMemSeg 2 256 thrpt 15 20.281 ± 0.060 ops/us
VectorScorerBenchmark.floatDotProductMemSeg 4 256 thrpt 15 20.261 ± 0.048 ops/us
VectorScorerBenchmark.floatDotProductMemSeg 6 256 thrpt 15 20.285 ± 0.063 ops/us
VectorScorerBenchmark.floatDotProductMemSeg 8 256 thrpt 15 20.359 ± 0.072 ops/us
VectorScorerBenchmark.floatDotProductMemSeg 16 256 thrpt 15 20.344 ± 0.078 ops/us
VectorScorerBenchmark.floatDotProductMemSeg 20 256 thrpt 15 20.272 ± 0.090 ops/us
VectorScorerBenchmark.floatDotProductMemSeg 32 256 thrpt 15 20.413 ± 0.010 ops/us
VectorScorerBenchmark.floatDotProductMemSeg 50 256 thrpt 15 20.066 ± 0.051 ops/us
VectorScorerBenchmark.floatDotProductMemSeg 64 256 thrpt 15 20.386 ± 0.051 ops/us
VectorScorerBenchmark.floatDotProductMemSeg 100 256 thrpt 15 20.029 ± 0.095 ops/us
VectorScorerBenchmark.floatDotProductMemSeg 128 256 thrpt 15 20.348 ± 0.049 ops/us
VectorScorerBenchmark.floatDotProductMemSeg 255 256 thrpt 15 20.047 ± 0.101 ops/us
VectorScorerBenchmark.floatDotProductMemSeg 256 256 thrpt 15 20.335 ± 0.037 ops/us
Net/net it seems like alignment of the mapped in-ram (virtual address space) doesn't matter?
I also tested newer CPU (Raptor Lake) -- I'll post that shortly.
Raptor Lake box is i9-13900K:
processor : 31
vendor_id : GenuineIntel
cpu family : 6
model : 183
model name : 13th Gen Intel(R) Core(TM) i9-13900K
stepping : 1
microcode : 0x12f
cpu MHz : 800.000
cache size : 36864 KB
physical id : 0
siblings : 32
core id : 47
cpu cores : 24
apicid : 94
initial apicid : 94
fpu : yes
fpu_exception : yes
cpuid level : 32
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx sm\
x est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt\
clwb intel_pt sha_ni xsaveopt xsavec xgetbv1 xsaves split_lock_detect user_shstk avx_vnni dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp hwp_pkg_req hfi vnmi umip pku ospke waitpkg gfni vaes vpclmulqdq rdpid movdiri movdir64b fsrm md_clear serialize pconfig arch_lbr ibt flush_l1d arch_capabilities
vmx flags : vnmi preemption_timer posted_intr invvpid ept_x_only ept_ad ept_1gb flexpriority apicv tsc_offset vtpr mtf vapic ept vpid unrestricted_guest vapic_reg vid ple shadow_vmcs ept_violation_ve ept_mode_based_exec tsc_scaling usr_wait_pause
bugs : spectre_v1 spectre_v2 spec_store_bypass swapgs eibrs_pbrsb rfds bhi spectre_v2_user
bogomips : 5990.40
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:
Results:
Benchmark (padBytes) (size) Mode Cnt Score Error Units
VectorScorerBenchmark.binaryDotProductDefault 0 256 thrpt 15 14.037 ± 0.061 ops/us
VectorScorerBenchmark.binaryDotProductDefault 1 256 thrpt 15 14.046 ± 0.071 ops/us
VectorScorerBenchmark.binaryDotProductDefault 2 256 thrpt 15 14.139 ± 0.089 ops/us
VectorScorerBenchmark.binaryDotProductDefault 4 256 thrpt 15 14.069 ± 0.040 ops/us
VectorScorerBenchmark.binaryDotProductDefault 6 256 thrpt 15 14.038 ± 0.072 ops/us
VectorScorerBenchmark.binaryDotProductDefault 8 256 thrpt 15 14.094 ± 0.070 ops/us
VectorScorerBenchmark.binaryDotProductDefault 16 256 thrpt 15 14.073 ± 0.059 ops/us
VectorScorerBenchmark.binaryDotProductDefault 20 256 thrpt 15 14.134 ± 0.075 ops/us
VectorScorerBenchmark.binaryDotProductDefault 32 256 thrpt 15 14.016 ± 0.044 ops/us
VectorScorerBenchmark.binaryDotProductDefault 50 256 thrpt 15 14.031 ± 0.046 ops/us
VectorScorerBenchmark.binaryDotProductDefault 64 256 thrpt 15 14.082 ± 0.068 ops/us
VectorScorerBenchmark.binaryDotProductDefault 100 256 thrpt 15 14.013 ± 0.059 ops/us
VectorScorerBenchmark.binaryDotProductDefault 128 256 thrpt 15 14.079 ± 0.069 ops/us
VectorScorerBenchmark.binaryDotProductDefault 255 256 thrpt 15 14.143 ± 0.074 ops/us
VectorScorerBenchmark.binaryDotProductDefault 256 256 thrpt 15 14.026 ± 0.028 ops/us
VectorScorerBenchmark.binaryDotProductMemSeg 0 256 thrpt 15 49.305 ± 0.244 ops/us
VectorScorerBenchmark.binaryDotProductMemSeg 1 256 thrpt 15 48.572 ± 0.030 ops/us
VectorScorerBenchmark.binaryDotProductMemSeg 2 256 thrpt 15 48.508 ± 0.198 ops/us
VectorScorerBenchmark.binaryDotProductMemSeg 4 256 thrpt 15 48.636 ± 0.094 ops/us
VectorScorerBenchmark.binaryDotProductMemSeg 6 256 thrpt 15 48.536 ± 0.185 ops/us
VectorScorerBenchmark.binaryDotProductMemSeg 8 256 thrpt 15 49.346 ± 0.166 ops/us
VectorScorerBenchmark.binaryDotProductMemSeg 16 256 thrpt 15 49.419 ± 0.102 ops/us
VectorScorerBenchmark.binaryDotProductMemSeg 20 256 thrpt 15 49.224 ± 0.396 ops/us
VectorScorerBenchmark.binaryDotProductMemSeg 32 256 thrpt 15 49.423 ± 0.134 ops/us
VectorScorerBenchmark.binaryDotProductMemSeg 50 256 thrpt 15 48.676 ± 0.167 ops/us
VectorScorerBenchmark.binaryDotProductMemSeg 64 256 thrpt 15 49.060 ± 0.866 ops/us
VectorScorerBenchmark.binaryDotProductMemSeg 100 256 thrpt 15 49.181 ± 0.210 ops/us
VectorScorerBenchmark.binaryDotProductMemSeg 128 256 thrpt 15 49.444 ± 0.082 ops/us
VectorScorerBenchmark.binaryDotProductMemSeg 255 256 thrpt 15 48.362 ± 0.163 ops/us
VectorScorerBenchmark.binaryDotProductMemSeg 256 256 thrpt 15 48.169 ± 5.291 ops/us
VectorScorerBenchmark.floatDotProductDefault 0 256 thrpt 15 23.215 ± 0.023 ops/us
VectorScorerBenchmark.floatDotProductDefault 1 256 thrpt 15 23.207 ± 0.067 ops/us
VectorScorerBenchmark.floatDotProductDefault 2 256 thrpt 15 23.181 ± 0.086 ops/us
VectorScorerBenchmark.floatDotProductDefault 4 256 thrpt 15 23.156 ± 0.290 ops/us
VectorScorerBenchmark.floatDotProductDefault 6 256 thrpt 15 23.232 ± 0.012 ops/us
VectorScorerBenchmark.floatDotProductDefault 8 256 thrpt 15 23.215 ± 0.091 ops/us
VectorScorerBenchmark.floatDotProductDefault 16 256 thrpt 15 23.194 ± 0.071 ops/us
VectorScorerBenchmark.floatDotProductDefault 20 256 thrpt 15 23.202 ± 0.083 ops/us
VectorScorerBenchmark.floatDotProductDefault 32 256 thrpt 15 23.207 ± 0.048 ops/us
VectorScorerBenchmark.floatDotProductDefault 50 256 thrpt 15 23.227 ± 0.031 ops/us
VectorScorerBenchmark.floatDotProductDefault 64 256 thrpt 15 23.187 ± 0.095 ops/us
VectorScorerBenchmark.floatDotProductDefault 100 256 thrpt 15 23.246 ± 0.114 ops/us
VectorScorerBenchmark.floatDotProductDefault 128 256 thrpt 15 23.214 ± 0.077 ops/us
VectorScorerBenchmark.floatDotProductDefault 255 256 thrpt 15 23.212 ± 0.035 ops/us
VectorScorerBenchmark.floatDotProductDefault 256 256 thrpt 15 23.239 ± 0.117 ops/us
VectorScorerBenchmark.floatDotProductMemSeg 0 256 thrpt 15 53.514 ± 5.159 ops/us
VectorScorerBenchmark.floatDotProductMemSeg 1 256 thrpt 15 49.594 ± 3.885 ops/us
VectorScorerBenchmark.floatDotProductMemSeg 2 256 thrpt 15 50.504 ± 0.122 ops/us
VectorScorerBenchmark.floatDotProductMemSeg 4 256 thrpt 15 51.385 ± 4.406 ops/us
VectorScorerBenchmark.floatDotProductMemSeg 6 256 thrpt 15 50.497 ± 0.146 ops/us
VectorScorerBenchmark.floatDotProductMemSeg 8 256 thrpt 15 52.327 ± 0.292 ops/us
VectorScorerBenchmark.floatDotProductMemSeg 16 256 thrpt 15 51.401 ± 4.426 ops/us
VectorScorerBenchmark.floatDotProductMemSeg 20 256 thrpt 15 52.373 ± 0.307 ops/us
VectorScorerBenchmark.floatDotProductMemSeg 32 256 thrpt 15 54.779 ± 0.078 ops/us
VectorScorerBenchmark.floatDotProductMemSeg 50 256 thrpt 15 49.447 ± 1.502 ops/us
VectorScorerBenchmark.floatDotProductMemSeg 64 256 thrpt 15 54.788 ± 0.060 ops/us
VectorScorerBenchmark.floatDotProductMemSeg 100 256 thrpt 15 51.600 ± 0.352 ops/us
VectorScorerBenchmark.floatDotProductMemSeg 128 256 thrpt 15 54.650 ± 0.377 ops/us
VectorScorerBenchmark.floatDotProductMemSeg 255 256 thrpt 15 50.042 ± 0.167 ops/us
VectorScorerBenchmark.floatDotProductMemSeg 256 256 thrpt 15 54.583 ± 0.399 ops/us
There might be small some mis-alignment penalty for float SIMD?
Thanks @mikemccand, there doesn't seem to be any performance penalty on "beast3 (nightly benchmarking box) -- a Ryzen Threadripper 3990X". There's definitely some impact of alignment on "Raptor Lake box is i9-13900K", but this is lower than my machine (<10%) -- so this alignment issue is mostly on Graviton, or ARM CPUs in general, as @rmuir shared?
I tried running knnPerfTest.py on Cohere vectors (768d) with DOT_PRODUCT similarity
main (4-byte-alignment)
recall latency(ms) netCPU avgCpuCount nDoc topK fanout maxConn beamWidth quantized visited index(s) index_docs/s force_merge(s) num_segments index_size(MB) vec_disk(MB) vec_RAM(MB) indexType
0.890 1.917 1.908 0.995 1000000 100 50 32 250 no 5388 67.66 14780.22 130.41 1 3014.60 2929.688 2929.688 HNSW
This PR (64-byte-alignment)
recall latency(ms) netCPU avgCpuCount nDoc topK fanout maxConn beamWidth quantized visited index(s) index_docs/s force_merge(s) num_segments index_size(MB) vec_disk(MB) vec_RAM(MB) indexType
0.891 1.845 1.836 0.995 1000000 100 50 32 250 no 5403 62.48 16004.61 130.03 1 3014.90 2929.688 2929.688 HNSW
Indexing was sped up by ~7.6%, while Search was sped up by ~3.8%
I see another action item from this benchmark: I wasn't aligning the output inside this merge function, which is used by HNSW-based vector formats for merging (see that index(s) improved in my benchmark, but not force_merge(s) -- which should speed up after this additional change?)
Just as a general comment around this performance: the 256-bit SVE vectors available on these processors have unfortunately not had a lot of love from our side.
Lots of digging into the 128-bit neons on the mac and the various AVX on intel, but not much yet on the SVE.
I also think they are in early stages on the java side: I see a lot of mail traffic making improvements to these vectors in the OpenJDK, especially that might impact the integer side (e.g. compress). Might even be worth trying to build snapshot of JDK.
Thanks @rmuir. How does the Panama Vector API handle alignment? Does it have methods to allocate aligned on-heap or off-heap vectors? Hmm it looks like SegmentAllocator has an allocate method that takes byteAlignment, so it is possible in pure Java.
I see another action item from this benchmark: I wasn't aligning the output inside this merge function, which is used by HNSW-based vector formats for merging (see that index(s) improved in my benchmark, but not force_merge(s) -- which should speed up after this additional change?)
Oh good catch! I wonder what other places might write the flat vectors?
Is the alignment also (or maybe less) important for the quantized cases? (Your results above are for float32 vectors?)
Maybe at least luceneutil could somehow warn if vectors are unaligned during its perf testing?
I wasn't aligning the output inside this merge function
Hmm this did not help for some reason (merge time increased)..
main (4-byte-alignment)
recall latency(ms) netCPU avgCpuCount nDoc topK fanout maxConn beamWidth quantized visited index(s) index_docs/s force_merge(s) num_segments index_size(MB) vec_disk(MB) vec_RAM(MB) indexType
0.891 1.895 1.887 0.996 1000000 100 50 32 250 no 5408 72.35 13822.46 101.30 1 3014.54 2929.688 2929.688 HNSW
This PR (64-byte-alignment)
recall latency(ms) netCPU avgCpuCount nDoc topK fanout maxConn beamWidth quantized visited index(s) index_docs/s force_merge(s) num_segments index_size(MB) vec_disk(MB) vec_RAM(MB) indexType
0.893 1.868 1.860 0.996 1000000 100 50 32 250 no 5426 69.55 14378.97 127.48 1 3015.40 2929.688 2929.688 HNSW
I'll still add a commit + revert, so people can see what I tried, and comment if I'm missing something!
Is the alignment also (or maybe less) important for the quantized cases?
I think alignment is less important for quantized vectors (which are stored as byte vectors on disk) -- because none of the JMH benchmarks show non-trivial variation with padding? (see VectorScorerBenchmark.binaryDotProductMemSeg)
Your results above are for
float32vectors?
Yeah, those benchmarks^ are for float vectors
Maybe at least
luceneutilcould somehow warn if vectors are unaligned during its perf testing?
I added some print statements to complain if non-64-byte-aligned addresses were used (MemorySegment#address % 64 == 0) -- and it complained in (only) baseline benchmarks..
Not committing because it may not be needed after this PR?
This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!
Sorry for the delay here, I ran benchmarks a few more times offline, and the differences in index(s) and force_merge(s) seem to be noisy (they take about the same time in main v/s this PR on average)
This is because:
- During indexing, the HNSW graph is built on-heap -- so there's no impact of alignment
- During merging, we create a new temp file to write vectors merged from all segments -- which is then used to score vectors during graph creation in the new segment -- and it starts at offset 0 (i.e. already aligned)
The only constant is the speedup in search time (3-4%)
Another recent run with 1M Cohere vectors, 768d, DOT_PRODUCT, force merged into a single segment:
main:
recall latency(ms) netCPU avgCpuCount nDoc topK fanout maxConn beamWidth quantized visited index(s) index_docs/s force_merge(s) num_segments index_size(MB) vec_disk(MB) vec_RAM(MB) indexType
0.892 1.880 1.872 0.996 1000000 100 50 32 250 no 5419 161.46 6193.68 197.68 1 3012.18 2929.688 2929.688 HNSW
This PR:
recall latency(ms) netCPU avgCpuCount nDoc topK fanout maxConn beamWidth quantized visited index(s) index_docs/s force_merge(s) num_segments index_size(MB) vec_disk(MB) vec_RAM(MB) indexType
0.891 1.824 1.815 0.995 1000000 100 50 32 250 no 5408 162.23 6163.94 197.08 1 3012.26 2929.688 2929.688 HNSW
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!