Kaival Parikh

Results 37 comments of Kaival Parikh

Thanks @benwtrent! I opened #14863

LGTM overall, do you have an estimate of performance impact? (JMH or luceneutil)

I wrote a small JMH benchmark to "pad" float vectors on disk with some `padBytes`: ``` Benchmark (padBytes) (size) Mode Cnt Score Error Units VectorScorerBenchmark.floatDotProductMemSeg 0 256 thrpt 15 32.848...

cc @mikemccand who found this^ byte-misalignment possibility offline!

Also noting that for byte vectors, I saw no impact of padding: ``` Benchmark (padBytes) (size) Mode Cnt Score Error Units VectorScorerBenchmark.binaryDotProductMemSeg 0 256 thrpt 15 20.453 ± 0.171 ops/us...

> Was this an `aarch64` CPU (Graviton 3 or 4?) Yes, it was a Graviton3 (`m7g`) CPU. `lscpu` says: ``` Architecture: aarch64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian...

Thanks @mikemccand, there doesn't seem to be any performance penalty on "[beast3 (nightly benchmarking box)](https://blog.mikemccandless.com/2021/01/apache-lucene-performance-on-128-core.html) -- a Ryzen Threadripper 3990X". There's definitely some impact of alignment on "Raptor Lake box...

> I wasn't aligning the output inside [this merge function](https://github.com/apache/lucene/blob/eb27b14eaa09c53496a50c5944160b4989910882/lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99FlatVectorsWriter.java#L252-L253) Hmm this did not help for some reason (merge time increased).. `main` (4-byte-alignment) ``` recall latency(ms) netCPU avgCpuCount nDoc topK...

Sorry for the delay here, I ran benchmarks a few more times offline, and the differences in `index(s)` and `force_merge(s)` seem to be noisy (they take about the same time...

[`VectorUtilBenchmark`](https://github.com/apache/lucene/blob/da8c674bf85855d4b56dc70ad44d207b437f3aca/lucene/benchmark-jmh/src/java/org/apache/lucene/benchmark/jmh/VectorUtilBenchmark.java#L46) results: `main` ``` Benchmark (size) Mode Cnt Score Error Units VectorUtilBenchmark.binaryCosineScalar 1024 thrpt 15 0.841 ± 0.001 ops/us VectorUtilBenchmark.binaryCosineVector 1024 thrpt 15 4.778 ± 0.012 ops/us VectorUtilBenchmark.binaryDotProductScalar 1024 thrpt...