luceneutil
luceneutil copied to clipboard
Switch nightly benchy to more realistic `Cohere/wikipedia-22-12-en-embeddings` vectors
#255 added realistic Cohere/wikipedia-22-12-en-embeddings 768 dim vectors to luceneutil -- let's switch over nightlies to use these vectors instead.
I attempted to follow the README instructions to generate nightly benchy vectors, using this command:
python3 -u src/python/infer_token_vectors_cohere.py ../data/cohere-wikipedia-768.vec 27625038 ../data/cohere-wikipedia-queries-768.vec 10000
(Note that the nightly benchy only does indexing, so I really only need the first file)
But this consumes gobbs of RAM apparently and the Linux OOME killer killed it!
Is this expected? I can run this on a beefier machine if need be (current machine has "only" 256 GB and no swap) for this one-time generation of vectors ...
Maybe datasets.load_dataset can load just the N vectors I need, not everything in the train split?
Oooh this load_datasets method takes a parameter keep_in_memory! I'll poke around.
OK well that keep_in_memory=False parameter seemed to do nothing -- still OOME killer at 256 GB RAM.
With this change to do chunking into 1M blocks of vectors when writing the index-time inferred vectors I was able to run the tool!
Full output below:
beast3:util.nightly[master]$ python3 -u src/python/infer_token_vectors_cohere.py ../data/cohere-wikipedia-768.vec 27625038 ../data/cohere-wikipedia-queries-768.vec 10000
Resolving data files: 100%|████████████████████████████████████████████| 253/253 [00:01<00:00, 250.70it/s]
Loading dataset shards: 100%|█████████████████████████████████████████| 252/252 [00:00<00:00, 1121.66it/s]
total number of rows: 35167920
embeddings dims: 768
saving docs[0:1000000 of shape: (1000000, 768) to file
saving docs[1000000:2000000 of shape: (1000000, 768) to file
saving docs[2000000:3000000 of shape: (1000000, 768) to file
saving docs[3000000:4000000 of shape: (1000000, 768) to file
saving docs[4000000:5000000 of shape: (1000000, 768) to file
saving docs[5000000:6000000 of shape: (1000000, 768) to file
saving docs[6000000:7000000 of shape: (1000000, 768) to file
saving docs[7000000:8000000 of shape: (1000000, 768) to file
saving docs[8000000:9000000 of shape: (1000000, 768) to file
saving docs[9000000:10000000 of shape: (1000000, 768) to file
saving docs[10000000:11000000 of shape: (1000000, 768) to file
saving docs[11000000:12000000 of shape: (1000000, 768) to file
saving docs[12000000:13000000 of shape: (1000000, 768) to file
saving docs[13000000:14000000 of shape: (1000000, 768) to file
saving docs[14000000:15000000 of shape: (1000000, 768) to file
saving docs[15000000:16000000 of shape: (1000000, 768) to file
saving docs[16000000:17000000 of shape: (1000000, 768) to file
saving docs[17000000:18000000 of shape: (1000000, 768) to file
saving docs[18000000:19000000 of shape: (1000000, 768) to file
saving docs[19000000:20000000 of shape: (1000000, 768) to file
saving docs[20000000:21000000 of shape: (1000000, 768) to file
saving docs[21000000:22000000 of shape: (1000000, 768) to file
saving docs[22000000:23000000 of shape: (1000000, 768) to file
saving docs[23000000:24000000 of shape: (1000000, 768) to file
saving docs[24000000:25000000 of shape: (1000000, 768) to file
saving docs[25000000:26000000 of shape: (1000000, 768) to file
saving docs[26000000:27000000 of shape: (1000000, 768) to file
saving docs[27000000:27625038 of shape: (625038, 768) to file
saving queries of shape: (10000, 768) to file
reading docs of shape: (27625038, 768)
reading queries shape: (10000, 768)
It produced a large .vec file:
beast3:util.nightly[master]$ ls -lh ../data/cohere-wikipedia-768.vec
-rw-r--r-- 1 mike mike 159G Mar 10 22:13 ../data/cohere-wikipedia-768.vec
Next I'll try switching to this source for nightly benchy. I'll also publish this on home.apache.org.
Hmm, except, that file is too large?
beast3:util.nightly[master]$ python3
Python 3.11.7 (main, Jan 29 2024, 16:03:57) [GCC 13.2.1 20230801] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> 27000000 * 768 * 4 / 1024 / 1024 / 1024
77.24761962890625
It's 159 GB but should be ~77 GB?
Maybe my "chunking" is buggy :)
OK I think these are float64 typed vectors, in which case the file size makes sense. But I think nightly benchy wants float32?
And I think knnPerfTest.py/KnnGraphTester.java also wants float32? I'm confused how they are working now on the generated file ...
Oooh this Dataset.cast method looks promising! I'll explore...
OK I made this change and kicked off infer_token_vectors_cohere.py again and it looks to at least be running...:
diff --git a/src/python/infer_token_vectors_cohere.py b/src/python/infer_token_vectors_cohere.py
index 5c350df..5027eb2 100644
--- a/src/python/infer_token_vectors_cohere.py
+++ b/src/python/infer_token_vectors_cohere.py
@@ -28,11 +28,19 @@ for name in (filename, filename_queries):
ds = datasets.load_dataset("Cohere/wikipedia-22-12-en-embeddings",
split="train")
+print(f'features: {ds.features}')
print(f"total number of rows: {len(ds)}")
print(f"embeddings dims: {len(ds[0]['emb'])}")
# ds = ds[:num_docs]
+# we just want the vector embeddings:
+for feature_name in ds.features.keys():
+ if feature_name != 'emb':
+ ds = ds.remove_columns(feature_name)
+
+ds = ds.cast(datasets.Features({'emb': datasets.Sequence(feature=datasets.Value("float32"))}))
+
# do this in windows, else the RAM usage is crazy (OOME even with 256
# GB RAM since I think this step makes 2X copy of the dataset?)
doc_upto = 0
OK hmm scratch that, I see from the already loaded features that Dataset thinks these emb vectors are already float32:
features: {'id': Value(dtype='int32', id=None), 'title': Value(dtype='string', id=None), 'text': Value(dtype='string', id=None), 'url': Value(dtype='string', id=None), 'wiki_id': Value(dtype='int32', id=None), 'views': Value(dtype='float32', id=None), 'paragraph_id': Value(dtype='int32', id=None), 'langs': Value(dtype='int32', id=None), 'emb': Sequence(feature=Value(dtype='float32', id=None), length=-1, id=None)}
OK! Now I think the issue is in np.array -- I think we have to give it a preferred data type, else, it seems to be casting the Dataset's float32 up to float64, maybe.
So, now I'm testing this:
diff --git a/src/python/infer_token_vectors_cohere.py b/src/python/infer_token_vectors_cohere.py
index 5c350df..4cc305e 100644
--- a/src/python/infer_token_vectors_cohere.py
+++ b/src/python/infer_token_vectors_cohere.py
@@ -28,11 +28,20 @@ for name in (filename, filename_queries):
ds = datasets.load_dataset("Cohere/wikipedia-22-12-en-embeddings",
split="train")
+print(f'features: {ds.features}')
print(f"total number of rows: {len(ds)}")
print(f"embeddings dims: {len(ds[0]['emb'])}")
# ds = ds[:num_docs]
+if False:
+ # we just want the vector embeddings:
+ for feature_name in ds.features.keys():
+ if feature_name != 'emb':
+ ds = ds.remove_columns(feature_name)
+
+ ds = ds.cast(datasets.Features({'emb': datasets.Sequence(feature=datasets.Value("float32"))}))
+
# do this in windows, else the RAM usage is crazy (OOME even with 256
# GB RAM since I think this step makes 2X copy of the dataset?)
doc_upto = 0
@@ -40,7 +49,7 @@ window_num_docs = 1000000
while doc_upto < num_docs:
next_doc_upto = min(doc_upto + window_num_docs, num_docs)
ds_embs = ds[doc_upto:next_doc_upto]['emb']
- embs = np.array(ds_embs)
+ embs = np.array(ds_embs, dtype=np.single)
print(f"saving docs[{doc_upto}:{next_doc_upto} of shape: {embs.shape} to file")
with open(filename, "ab") as out_f:
embs.tofile(out_f)
OK the above change seemed to have worked (I just pushed it)! I now see these vector files:
-rw-r--r-- 1 mike mike 80G Mar 28 12:57 cohere-wikipedia-768.vec
-rw-r--r-- 1 mike mike 586M Mar 28 12:57 cohere-wikipedia-queries-768.vec
Now I will try to confirm their recall seems sane, and then switch nightly to them.
OK I think the next wrinkle here is ... to fix SearchPerfTest to use the pre-computed Cohere query vectors from cohere-wikipedia-queries-768.vec, instead of attempting to do inference based on the lexical tokens of each incoming query. I guess we could just incrementally pull the vectors from the query vectors file and assign them sequentially to each vector query we see? @msokolov does that sound reasonable?
I think we can modify VectorDictionary so accept a --no-tokenize option and then lookup the vector using the full query text? We would need to generate a text file with the queries, one per line, to correspond with the binary vector file.
otherwise you could simply select some random vector every time you see a vector-type query task?? But I would expect some vectors behave differently from others? Not sure
I was finally able to index/search using these Cohere vectors, and the profiler output is sort of strange:
This is CPU:
PROFILE SUMMARY from 44698 events (total: 44698)
tests.profile.mode=cpu
tests.profile.count=30
tests.profile.stacksize=1
tests.profile.linenumbers=false
PERCENT CPU SAMPLES STACK
10.98% 4907 jdk.internal.misc.ScopedMemoryAccess#getByteInternal()
6.05% 2702 jdk.incubator.vector.FloatVector#reduceLanesTemplate()
4.06% 1813 org.apache.lucene.store.MemorySegmentIndexInput#readByte()
3.63% 1622 perf.PKLookupTask#go()
2.89% 1292 org.apache.lucene.store.DataInput#readVInt()
2.84% 1269 org.apache.lucene.util.hnsw.HnswGraphSearcher#searchLevel()
2.58% 1153 org.apache.lucene.util.fst.FST#findTargetArc()
2.45% 1093 jdk.incubator.vector.FloatVector#fromArray0Template()
2.28% 1018 org.apache.lucene.util.LongHeap#downHeap()
2.25% 1007 org.apache.lucene.util.SparseFixedBitSet#insertLong()
2.06% 921 org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnum#seekExact()
2.05% 916 jdk.internal.foreign.AbstractMemorySegmentImpl#checkBounds()
1.96% 875 jdk.incubator.vector.FloatVector#lanewiseTemplate()
1.91% 852 jdk.internal.util.ArraysSupport#mismatch()
1.40% 627 org.apache.lucene.util.compress.LZ4#decompress()
1.31% 586 jdk.internal.foreign.MemorySessionImpl#checkValidStateRaw()
1.18% 526 org.apache.lucene.util.SparseFixedBitSet#getAndSet()
1.15% 516 org.apache.lucene.codecs.lucene99.Lucene99HnswVectorsReader$OffHeapHnswGraph#seek()
0.99% 444 org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader$BlockState#doReset()
0.99% 441 org.apache.lucene.util.BytesRef#compareTo()
0.94% 418 org.apache.lucene.util.fst.FST#readArcByDirectAddressing()
0.92% 413 org.apache.lucene.search.TopKnnCollector#topDocs()
0.91% 408 org.apache.lucene.index.VectorSimilarityFunction$2#compare()
0.90% 401 org.apache.lucene.codecs.hnsw.DefaultFlatVectorScorer$FloatVectorScorer#score()
0.80% 359 org.apache.lucene.index.SegmentInfo#maxDoc()
0.79% 353 java.util.Arrays#fill()
0.75% 334 org.apache.lucene.codecs.lucene99.Lucene99PostingsReader#decodeTerm()
0.73% 326 java.util.Arrays#compareUnsigned()
0.72% 324 org.apache.lucene.search.ReferenceManager#acquire()
0.71% 317 org.apache.lucene.store.DataInput#readVLong()
and this is HEAP:
PROFILE SUMMARY from 748 events (total: 38182M)
tests.profile.mode=heap
tests.profile.count=30
tests.profile.stacksize=1
tests.profile.linenumbers=false
PERCENT HEAP SAMPLES STACK
21.88% 8355M java.util.concurrent.locks.AbstractQueuedSynchronizer#acquire()
13.50% 5154M org.apache.lucene.util.ArrayUtil#growNoCopy()
9.31% 3556M org.apache.lucene.util.SparseFixedBitSet#insertLong()
9.00% 3436M perf.StatisticsHelper#startStatistics()
9.00% 3436M java.util.ArrayList#iterator()
5.76% 2199M org.apache.lucene.util.fst.ByteSequenceOutputs#read()
3.60% 1374M org.apache.lucene.util.BytesRef#<init>()
3.56% 1357M org.apache.lucene.codecs.lucene95.OffHeapFloatVectorValues#<init>()
3.52% 1345M org.apache.lucene.util.ArrayUtil#growExact()
2.65% 1013M org.apache.lucene.search.TopKnnCollector#topDocs()
2.50% 956M java.util.concurrent.locks.AbstractQueuedSynchronizer#tryInitializeHead()
2.34% 893M org.apache.lucene.util.SparseFixedBitSet#insertBlock()
1.98% 755M org.apache.lucene.util.LongHeap#<init>()
1.51% 578M java.util.logging.LogManager#reset()
1.51% 578M java.util.concurrent.FutureTask#runAndReset()
1.51% 578M jdk.jfr.internal.ShutdownHook#run()
1.21% 463M jdk.internal.foreign.MappedMemorySegmentImpl#dup()
0.90% 343M java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject#newConditionNode()
0.83% 315M org.apache.lucene.util.SparseFixedBitSet#<init>()
0.60% 229M org.apache.lucene.util.hnsw.FloatHeap#<init>()
0.52% 200M org.apache.lucene.util.hnsw.FloatHeap#getHeap()
0.45% 171M jdk.internal.misc.Unsafe#allocateUninitializedArray()
0.45% 171M org.apache.lucene.util.packed.DirectMonotonicReader#getInstance()
0.45% 171M org.apache.lucene.store.DataInput#readString()
0.22% 85M org.apache.lucene.search.knn.TopKnnCollectorManager#newCollector()
0.22% 85M org.apache.lucene.search.knn.MultiLeafKnnCollector#<init>()
0.15% 57M org.apache.lucene.store.MemorySegmentIndexInput#buildSlice()
0.08% 28M perf.TaskParser$TaskBuilder#parseVectorQuery()
0.07% 28M java.util.regex.Pattern#matcher()
0.07% 28M org.apache.lucene.search.TaskExecutor$TaskGroup#createTask()
Why are we reading individual bytes so intensively? And why is lock acquisition the top HEAP object creator!?
Here's the perf.py I ran (just A/A):
import sys
sys.path.insert(0, '/l/util/src/python')
import competition
if __name__ == '__main__':
sourceData = competition.sourceData('wikimediumall')
sourceData.tasksFile = '/l/util/just-vector-search.tasks'
comp = competition.Competition(taskRepeatCount=200)
#comp.addTaskPattern('HighTerm$')
checkout = 'trunk'
index = comp.newIndex(checkout, sourceData, numThreads=36, addDVFields=True,
grouping=False, useCMS=True,
#javaCommand='/opt/jdk-18-ea-28/bin/java --add-modules jdk.incubator.foreign -Xmx32g -Xms32g -server -XX:+UseParallelGC -Djava.io.tmpdir=/l/tmp',
ramBufferMB=256,
analyzer = 'StandardAnalyzerNoStopWords',
vectorFile = '/lucenedata/enwiki/cohere-wikipedia-768.vec',
vectorDimension = 768,
hnswThreadsPerMerge = 4,
hnswThreadPoolCount = 16,
vectorEncoding = 'FLOAT32',
verbose = True,
name = 'mikes-vector-test',
facets = (('taxonomy:Date', 'Date'),
('taxonomy:Month', 'Month'),
('taxonomy:DayOfYear', 'DayOfYear'),
('taxonomy:RandomLabel.taxonomy', 'RandomLabel'),
('sortedset:Date', 'Date'),
('sortedset:Month', 'Month'),
('sortedset:DayOfYear', 'DayOfYear'),
('sortedset:RandomLabel.sortedset', 'RandomLabel')))
comp.competitor('base', checkout, index=index, vectorFileName='/lucenedata/enwiki/cohere-wikipedia-queries-768.vec', vectorDimension=768,
#javacCommand='/opt/jdk-18-ea-28/bin/javac',
#javaCommand='/opt/jdk-18-ea-28/bin/java --add-modules jdk.incubator.foreign -Xmx32g -Xms32g -server -XX:+UseParallelGC -Djava.io.tmpdir=/l/tmp')
)
comp.competitor('comp', checkout, index=index, vectorFileName='/lucenedata/enwiki/cohere-wikipedia-queries-768.vec', vectorDimension=768,
#javacCommand='/opt/jdk-18-ea-28/bin/javac',
#javaCommand='/opt/jdk-18-ea-28/bin/java --add-modules jdk.incubator.foreign -Xmx32g -Xms32g -server -XX:+UseParallelGC -Djava.io.tmpdir=/l/tmp')
)
comp.benchmark('atoa')
More thread context for the CPU profiling:
PROFILE SUMMARY from 10264 events (total: 10264)
tests.profile.mode=cpu
tests.profile.count=50
tests.profile.stacksize=8
tests.profile.linenumbers=false
PERCENT CPU SAMPLES STACK
12.59% 1292 jdk.incubator.vector.FloatVector#reduceLanesTemplate()
at jdk.incubator.vector.Float256Vector#reduceLanes()
at org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport#dotProductBody()
at org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport#dotProduct()
at org.apache.lucene.util.VectorUtil#dotProduct()
at org.apache.lucene.index.VectorSimilarityFunction$2#compare()
at org.apache.lucene.codecs.hnsw.DefaultFlatVectorScorer$FloatVectorScorer#score()
at org.apache.lucene.util.hnsw.HnswGraphSearcher#searchLevel()
7.16% 735 org.apache.lucene.store.DataInput#readVInt()
at org.apache.lucene.store.MemorySegmentIndexInput#readVInt()
at org.apache.lucene.codecs.lucene99.Lucene99HnswVectorsReader$OffHeapHnswGraph#seek()
at org.apache.lucene.util.hnsw.HnswGraphSearcher#graphSeek()
at org.apache.lucene.util.hnsw.HnswGraphSearcher#searchLevel()
at org.apache.lucene.util.hnsw.HnswGraphSearcher#search()
at org.apache.lucene.util.hnsw.HnswGraphSearcher#search()
at org.apache.lucene.codecs.lucene99.Lucene99HnswVectorsReader#search()
4.25% 436 jdk.incubator.vector.FloatVector#lanewiseTemplate()
at jdk.incubator.vector.Float256Vector#lanewise()
at jdk.incubator.vector.Float256Vector#lanewise()
at jdk.incubator.vector.FloatVector#fma()
at org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport#fma()
at org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport#dotProductBody()
at org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport#dotProduct()
at org.apache.lucene.util.VectorUtil#dotProduct()
4.21% 432 jdk.internal.misc.ScopedMemoryAccess#getByteInternal()
at jdk.internal.misc.ScopedMemoryAccess#getByte()
at java.lang.invoke.VarHandleSegmentAsBytes#get()
at java.lang.invoke.VarHandleGuards#guard_LJ_I()
at java.lang.foreign.MemorySegment#get()
at org.apache.lucene.store.MemorySegmentIndexInput#readByte()
at org.apache.lucene.store.DataInput#readVInt()
at org.apache.lucene.store.MemorySegmentIndexInput#readVInt()
4.08% 419 org.apache.lucene.util.SparseFixedBitSet#insertLong()
at org.apache.lucene.util.SparseFixedBitSet#getAndSet()
at org.apache.lucene.util.hnsw.HnswGraphSearcher#searchLevel()
at org.apache.lucene.util.hnsw.HnswGraphSearcher#search()
at org.apache.lucene.util.hnsw.HnswGraphSearcher#search()
at org.apache.lucene.codecs.lucene99.Lucene99HnswVectorsReader#search()
at org.apache.lucene.codecs.perfield.PerFieldKnnVectorsFormat$FieldsReader#search()
at org.apache.lucene.index.CodecReader#searchNearestVectors()
2.84% 292 org.apache.lucene.util.LongHeap#downHeap()
at org.apache.lucene.util.LongHeap#pop()
at org.apache.lucene.util.hnsw.NeighborQueue#pop()
at org.apache.lucene.search.TopKnnCollector#topDocs()
at org.apache.lucene.search.knn.MultiLeafKnnCollector#topDocs()
at org.apache.lucene.search.KnnFloatVectorQuery#approximateSearch()
at org.apache.lucene.search.AbstractKnnVectorQuery#getLeafResults()
at org.apache.lucene.search.AbstractKnnVectorQuery#searchLeaf()
2.74% 281 org.apache.lucene.index.VectorSimilarityFunction$2#compare()
at org.apache.lucene.codecs.hnsw.DefaultFlatVectorScorer$FloatVectorScorer#score()
at org.apache.lucene.util.hnsw.HnswGraphSearcher#searchLevel()
at org.apache.lucene.util.hnsw.HnswGraphSearcher#search()
at org.apache.lucene.util.hnsw.HnswGraphSearcher#search()
at org.apache.lucene.codecs.lucene99.Lucene99HnswVectorsReader#search()
at org.apache.lucene.codecs.perfield.PerFieldKnnVectorsFormat$FieldsReader#search()
at org.apache.lucene.index.CodecReader#searchNearestVectors()
2.58% 265 org.apache.lucene.util.compress.LZ4#decompress()
at org.apache.lucene.codecs.lucene90.LZ4WithPresetDictCompressionMode$LZ4WithPresetDictDecompressor#decompress()
at org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader$BlockState#document()
at org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader#serializedDocument()
at org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader#document()
at org.apache.lucene.index.CodecReader$1#document()
at org.apache.lucene.index.BaseCompositeReader$2#document()
at org.apache.lucene.index.StoredFields#document()
2.48% 255 org.apache.lucene.util.SparseFixedBitSet#getAndSet()
at org.apache.lucene.util.hnsw.HnswGraphSearcher#searchLevel()
at org.apache.lucene.util.hnsw.HnswGraphSearcher#search()
at org.apache.lucene.util.hnsw.HnswGraphSearcher#search()
at org.apache.lucene.codecs.lucene99.Lucene99HnswVectorsReader#search()
at org.apache.lucene.codecs.perfield.PerFieldKnnVectorsFormat$FieldsReader#search()
at org.apache.lucene.index.CodecReader#searchNearestVectors()
at org.apache.lucene.search.KnnFloatVectorQuery#approximateSearch()
Curious that readVInt, when seeking to load a vector (?) is 2nd hotspot?
VInts are used to encode the HNSW graph, so it looks like decoding the graph is where that is happening (vis
at
org.apache.lucene.codecs.lucene99.Lucene99HnswVectorsReader$OffHeapHnswGraph#seek())
On Mon, Jun 10, 2024 at 2:12 PM Michael McCandless @.***> wrote:
More thread context for the CPU profiling:
PROFILE SUMMARY from 10264 events (total: 10264) tests.profile.mode=cpu tests.profile.count=50 tests.profile.stacksize=8 tests.profile.linenumbers=false PERCENT CPU SAMPLES STACK 12.59% 1292 jdk.incubator.vector.FloatVector#reduceLanesTemplate() at jdk.incubator.vector.Float256Vector#reduceLanes() at org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport#dotProductBody() at org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport#dotProduct() at org.apache.lucene.util.VectorUtil#dotProduct() at org.apache.lucene.index.VectorSimilarityFunction$2#compare() at org.apache.lucene.codecs.hnsw.DefaultFlatVectorScorer$FloatVectorScorer#score() at org.apache.lucene.util.hnsw.HnswGraphSearcher#searchLevel() 7.16% 735 org.apache.lucene.store.DataInput#readVInt() at org.apache.lucene.store.MemorySegmentIndexInput#readVInt() at org.apache.lucene.codecs.lucene99.Lucene99HnswVectorsReader$OffHeapHnswGraph#seek() at org.apache.lucene.util.hnsw.HnswGraphSearcher#graphSeek() at org.apache.lucene.util.hnsw.HnswGraphSearcher#searchLevel() at org.apache.lucene.util.hnsw.HnswGraphSearcher#search() at org.apache.lucene.util.hnsw.HnswGraphSearcher#search() at org.apache.lucene.codecs.lucene99.Lucene99HnswVectorsReader#search() 4.25% 436 jdk.incubator.vector.FloatVector#lanewiseTemplate() at jdk.incubator.vector.Float256Vector#lanewise() at jdk.incubator.vector.Float256Vector#lanewise() at jdk.incubator.vector.FloatVector#fma() at org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport#fma() at org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport#dotProductBody() at org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport#dotProduct() at org.apache.lucene.util.VectorUtil#dotProduct() 4.21% 432 jdk.internal.misc.ScopedMemoryAccess#getByteInternal() at jdk.internal.misc.ScopedMemoryAccess#getByte() at java.lang.invoke.VarHandleSegmentAsBytes#get() at java.lang.invoke.VarHandleGuards#guard_LJ_I() at java.lang.foreign.MemorySegment#get() at org.apache.lucene.store.MemorySegmentIndexInput#readByte() at org.apache.lucene.store.DataInput#readVInt() at org.apache.lucene.store.MemorySegmentIndexInput#readVInt() 4.08% 419 org.apache.lucene.util.SparseFixedBitSet#insertLong() at org.apache.lucene.util.SparseFixedBitSet#getAndSet() at org.apache.lucene.util.hnsw.HnswGraphSearcher#searchLevel() at org.apache.lucene.util.hnsw.HnswGraphSearcher#search() at org.apache.lucene.util.hnsw.HnswGraphSearcher#search() at org.apache.lucene.codecs.lucene99.Lucene99HnswVectorsReader#search() at org.apache.lucene.codecs.perfield.PerFieldKnnVectorsFormat$FieldsReader#search() at org.apache.lucene.index.CodecReader#searchNearestVectors() 2.84% 292 org.apache.lucene.util.LongHeap#downHeap() at org.apache.lucene.util.LongHeap#pop() at org.apache.lucene.util.hnsw.NeighborQueue#pop() at org.apache.lucene.search.TopKnnCollector#topDocs() at org.apache.lucene.search.knn.MultiLeafKnnCollector#topDocs() at org.apache.lucene.search.KnnFloatVectorQuery#approximateSearch() at org.apache.lucene.search.AbstractKnnVectorQuery#getLeafResults() at org.apache.lucene.search.AbstractKnnVectorQuery#searchLeaf() 2.74% 281 org.apache.lucene.index.VectorSimilarityFunction$2#compare() at org.apache.lucene.codecs.hnsw.DefaultFlatVectorScorer$FloatVectorScorer#score() at org.apache.lucene.util.hnsw.HnswGraphSearcher#searchLevel() at org.apache.lucene.util.hnsw.HnswGraphSearcher#search() at org.apache.lucene.util.hnsw.HnswGraphSearcher#search() at org.apache.lucene.codecs.lucene99.Lucene99HnswVectorsReader#search() at org.apache.lucene.codecs.perfield.PerFieldKnnVectorsFormat$FieldsReader#search() at org.apache.lucene.index.CodecReader#searchNearestVectors() 2.58% 265 org.apache.lucene.util.compress.LZ4#decompress() at org.apache.lucene.codecs.lucene90.LZ4WithPresetDictCompressionMode$LZ4WithPresetDictDecompressor#decompress() at org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader$BlockState#document() at org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader#serializedDocument() at org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader#document() at org.apache.lucene.index.CodecReader$1#document() at org.apache.lucene.index.BaseCompositeReader$2#document() at org.apache.lucene.index.StoredFields#document() 2.48% 255 org.apache.lucene.util.SparseFixedBitSet#getAndSet() at org.apache.lucene.util.hnsw.HnswGraphSearcher#searchLevel() at org.apache.lucene.util.hnsw.HnswGraphSearcher#search() at org.apache.lucene.util.hnsw.HnswGraphSearcher#search() at org.apache.lucene.codecs.lucene99.Lucene99HnswVectorsReader#search() at org.apache.lucene.codecs.perfield.PerFieldKnnVectorsFormat$FieldsReader#search() at org.apache.lucene.index.CodecReader#searchNearestVectors() at org.apache.lucene.search.KnnFloatVectorQuery#approximateSearch()
Curious that readVInt, when seeking to load a vector (?) is 2nd hotspot?
— Reply to this email directly, view it on GitHub https://github.com/mikemccand/luceneutil/issues/256#issuecomment-2159000055, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHHUQKVIMMHESS7RMPDGLDZGXUBXAVCNFSM6AAAAABEHTEKSOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNJZGAYDAMBVGU . You are receiving this because you were mentioned.Message ID: @.***>
I might be missing it, but where is the similarity defined for using the cohere vectors? They are designed for max inner product, if we use euclidean, I would expect graph building and indexing to be poor as we might get stuck in local minima.
The benchmark tools are hard-coded to use DOT_PRODUCT; see https://github.com/mikemccand/luceneutil/blob/main/src/main/perf/LineFileDocs.java#L454
Maybe this is why we get such poor results w/Cohere?
@msokolov using dot_product likely doesn't work with 768 cohere unless they are manually normalized. If these things aren't normalized, we will be getting some whacky scores and we likely lose a bunch of information by snapping to be greater than 0.
I could maybe see cosine working.
But I would suggest we switch to max-inner-product for Cohere 768 for a true test with those vectors as they were designed to be used.
I ran a test comparing mip and angular over Cohere Wikipedia vectors (what KnnGraphTester calls MAXIMUM_INNER_PRODUCT and DOT_PRODUCT) and the results were surprising:
mainline, Cohere, angular
recall latency (ms) nDoc topK fanout maxConn beamWidth quantized index s force merge s num segments index size (MB)
0.631 0.496 1500000 10 6 32 50 no 330.94 213.55 1 4436.82
0.617 0.439 1500000 10 6 32 50 7 bits 352.52 217.64 1 5543.35
0.408 0.422 1500000 10 6 32 50 4 bits 340.32 151.22 1 5544.56
mainline, Cohere, mip
recall latency (ms) nDoc topK fanout maxConn beamWidth quantized index s force merge s num segments index size (MB)
0.593 0.475 1500000 10 6 32 50 no 325.19 210.78 1 4436.81
0.601 0.454 1500000 10 6 32 50 7 bits 346.48 218.88 1 5543.35
0.405 0.307 1500000 10 6 32 50 4 bits 345.31 144.83 1 5544.56
I think we're pretty much use Cohere embeddings for most knn benchmarks, including nightlies? Should we change the default similarity to mip in knnGraphTester?