luceneutil icon indicating copy to clipboard operation
luceneutil copied to clipboard

Switch nightly benchy to more realistic `Cohere/wikipedia-22-12-en-embeddings` vectors

Open mikemccand opened this issue 1 year ago • 23 comments

#255 added realistic Cohere/wikipedia-22-12-en-embeddings 768 dim vectors to luceneutil -- let's switch over nightlies to use these vectors instead.

mikemccand avatar Mar 05 '24 17:03 mikemccand

I attempted to follow the README instructions to generate nightly benchy vectors, using this command:

python3 -u src/python/infer_token_vectors_cohere.py ../data/cohere-wikipedia-768.vec 27625038 ../data/cohere-wikipedia-queries-768.vec 10000

(Note that the nightly benchy only does indexing, so I really only need the first file)

But this consumes gobbs of RAM apparently and the Linux OOME killer killed it!

Is this expected? I can run this on a beefier machine if need be (current machine has "only" 256 GB and no swap) for this one-time generation of vectors ...

Maybe datasets.load_dataset can load just the N vectors I need, not everything in the train split?

mikemccand avatar Mar 10 '24 20:03 mikemccand

Oooh this load_datasets method takes a parameter keep_in_memory! I'll poke around.

mikemccand avatar Mar 10 '24 21:03 mikemccand

OK well that keep_in_memory=False parameter seemed to do nothing -- still OOME killer at 256 GB RAM.

With this change to do chunking into 1M blocks of vectors when writing the index-time inferred vectors I was able to run the tool!

Full output below:

beast3:util.nightly[master]$ python3 -u src/python/infer_token_vectors_cohere.py ../data/cohere-wikipedia-768.vec 27625038 ../data/cohere-wikipedia-queries-768.vec 10000
Resolving data files: 100%|████████████████████████████████████████████| 253/253 [00:01<00:00, 250.70it/s]
Loading dataset shards: 100%|█████████████████████████████████████████| 252/252 [00:00<00:00, 1121.66it/s]
total number of rows: 35167920
embeddings dims: 768
saving docs[0:1000000 of shape: (1000000, 768) to file
saving docs[1000000:2000000 of shape: (1000000, 768) to file
saving docs[2000000:3000000 of shape: (1000000, 768) to file
saving docs[3000000:4000000 of shape: (1000000, 768) to file
saving docs[4000000:5000000 of shape: (1000000, 768) to file
saving docs[5000000:6000000 of shape: (1000000, 768) to file
saving docs[6000000:7000000 of shape: (1000000, 768) to file
saving docs[7000000:8000000 of shape: (1000000, 768) to file
saving docs[8000000:9000000 of shape: (1000000, 768) to file
saving docs[9000000:10000000 of shape: (1000000, 768) to file
saving docs[10000000:11000000 of shape: (1000000, 768) to file
saving docs[11000000:12000000 of shape: (1000000, 768) to file
saving docs[12000000:13000000 of shape: (1000000, 768) to file
saving docs[13000000:14000000 of shape: (1000000, 768) to file
saving docs[14000000:15000000 of shape: (1000000, 768) to file
saving docs[15000000:16000000 of shape: (1000000, 768) to file
saving docs[16000000:17000000 of shape: (1000000, 768) to file
saving docs[17000000:18000000 of shape: (1000000, 768) to file
saving docs[18000000:19000000 of shape: (1000000, 768) to file
saving docs[19000000:20000000 of shape: (1000000, 768) to file
saving docs[20000000:21000000 of shape: (1000000, 768) to file
saving docs[21000000:22000000 of shape: (1000000, 768) to file
saving docs[22000000:23000000 of shape: (1000000, 768) to file
saving docs[23000000:24000000 of shape: (1000000, 768) to file
saving docs[24000000:25000000 of shape: (1000000, 768) to file
saving docs[25000000:26000000 of shape: (1000000, 768) to file
saving docs[26000000:27000000 of shape: (1000000, 768) to file
saving docs[27000000:27625038 of shape: (625038, 768) to file
saving queries of shape: (10000, 768) to file
reading docs of shape: (27625038, 768)
reading queries shape: (10000, 768)

It produced a large .vec file:

beast3:util.nightly[master]$ ls -lh ../data/cohere-wikipedia-768.vec
-rw-r--r-- 1 mike mike 159G Mar 10 22:13 ../data/cohere-wikipedia-768.vec

Next I'll try switching to this source for nightly benchy. I'll also publish this on home.apache.org.

mikemccand avatar Mar 11 '24 13:03 mikemccand

Hmm, except, that file is too large?

beast3:util.nightly[master]$ python3
Python 3.11.7 (main, Jan 29 2024, 16:03:57) [GCC 13.2.1 20230801] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> 27000000 * 768 * 4 / 1024 / 1024 / 1024
77.24761962890625

It's 159 GB but should be ~77 GB?

Maybe my "chunking" is buggy :)

mikemccand avatar Mar 11 '24 13:03 mikemccand

OK I think these are float64 typed vectors, in which case the file size makes sense. But I think nightly benchy wants float32?

mikemccand avatar Mar 11 '24 19:03 mikemccand

And I think knnPerfTest.py/KnnGraphTester.java also wants float32? I'm confused how they are working now on the generated file ...

mikemccand avatar Mar 11 '24 23:03 mikemccand

Oooh this Dataset.cast method looks promising! I'll explore...

mikemccand avatar Mar 28 '24 12:03 mikemccand

OK I made this change and kicked off infer_token_vectors_cohere.py again and it looks to at least be running...:

diff --git a/src/python/infer_token_vectors_cohere.py b/src/python/infer_token_vectors_cohere.py
index 5c350df..5027eb2 100644
--- a/src/python/infer_token_vectors_cohere.py
+++ b/src/python/infer_token_vectors_cohere.py
@@ -28,11 +28,19 @@ for name in (filename, filename_queries):

 ds = datasets.load_dataset("Cohere/wikipedia-22-12-en-embeddings",
                            split="train")
+print(f'features: {ds.features}')
 print(f"total number of rows: {len(ds)}")
 print(f"embeddings dims: {len(ds[0]['emb'])}")

 # ds = ds[:num_docs]

+# we just want the vector embeddings:
+for feature_name in ds.features.keys():
+  if feature_name != 'emb':
+    ds = ds.remove_columns(feature_name)
+
+ds = ds.cast(datasets.Features({'emb': datasets.Sequence(feature=datasets.Value("float32"))}))
+
 # do this in windows, else the RAM usage is crazy (OOME even with 256
 # GB RAM since I think this step makes 2X copy of the dataset?)
 doc_upto = 0

mikemccand avatar Mar 28 '24 14:03 mikemccand

OK hmm scratch that, I see from the already loaded features that Dataset thinks these emb vectors are already float32:

features: {'id': Value(dtype='int32', id=None), 'title': Value(dtype='string', id=None), 'text': Value(dtype='string', id=None), 'url': Value(dtype='string', id=None), 'wiki_id': Value(dtype='int32', id=None), 'views': Value(dtype='float32', id=None), 'paragraph_id': Value(dtype='int32', id=None), 'langs': Value(dtype='int32', id=None), 'emb': Sequence(feature=Value(dtype='float32', id=None), length=-1, id=None)}

mikemccand avatar Mar 28 '24 14:03 mikemccand

OK! Now I think the issue is in np.array -- I think we have to give it a preferred data type, else, it seems to be casting the Dataset's float32 up to float64, maybe.

So, now I'm testing this:

diff --git a/src/python/infer_token_vectors_cohere.py b/src/python/infer_token_vectors_cohere.py
index 5c350df..4cc305e 100644
--- a/src/python/infer_token_vectors_cohere.py
+++ b/src/python/infer_token_vectors_cohere.py
@@ -28,11 +28,20 @@ for name in (filename, filename_queries):

 ds = datasets.load_dataset("Cohere/wikipedia-22-12-en-embeddings",
                            split="train")
+print(f'features: {ds.features}')
 print(f"total number of rows: {len(ds)}")
 print(f"embeddings dims: {len(ds[0]['emb'])}")

 # ds = ds[:num_docs]

+if False:
+  # we just want the vector embeddings:
+  for feature_name in ds.features.keys():
+    if feature_name != 'emb':
+      ds = ds.remove_columns(feature_name)
+
+  ds = ds.cast(datasets.Features({'emb': datasets.Sequence(feature=datasets.Value("float32"))}))
+
 # do this in windows, else the RAM usage is crazy (OOME even with 256
 # GB RAM since I think this step makes 2X copy of the dataset?)
 doc_upto = 0
@@ -40,7 +49,7 @@ window_num_docs = 1000000
 while doc_upto < num_docs:
   next_doc_upto = min(doc_upto + window_num_docs, num_docs)
   ds_embs = ds[doc_upto:next_doc_upto]['emb']
-  embs = np.array(ds_embs)
+  embs = np.array(ds_embs, dtype=np.single)
   print(f"saving docs[{doc_upto}:{next_doc_upto} of shape: {embs.shape} to file")
   with open(filename, "ab") as out_f:
       embs.tofile(out_f)

mikemccand avatar Mar 28 '24 14:03 mikemccand

OK the above change seemed to have worked (I just pushed it)! I now see these vector files:

-rw-r--r-- 1 mike mike  80G Mar 28 12:57 cohere-wikipedia-768.vec
-rw-r--r-- 1 mike mike 586M Mar 28 12:57 cohere-wikipedia-queries-768.vec

Now I will try to confirm their recall seems sane, and then switch nightly to them.

mikemccand avatar Apr 29 '24 11:04 mikemccand

OK I think the next wrinkle here is ... to fix SearchPerfTest to use the pre-computed Cohere query vectors from cohere-wikipedia-queries-768.vec, instead of attempting to do inference based on the lexical tokens of each incoming query. I guess we could just incrementally pull the vectors from the query vectors file and assign them sequentially to each vector query we see? @msokolov does that sound reasonable?

mikemccand avatar Apr 29 '24 12:04 mikemccand

I think we can modify VectorDictionary so accept a --no-tokenize option and then lookup the vector using the full query text? We would need to generate a text file with the queries, one per line, to correspond with the binary vector file.

msokolov avatar Apr 29 '24 22:04 msokolov

otherwise you could simply select some random vector every time you see a vector-type query task?? But I would expect some vectors behave differently from others? Not sure

msokolov avatar Apr 30 '24 18:04 msokolov

I was finally able to index/search using these Cohere vectors, and the profiler output is sort of strange:

This is CPU:

PROFILE SUMMARY from 44698 events (total: 44698)
  tests.profile.mode=cpu
  tests.profile.count=30
  tests.profile.stacksize=1
  tests.profile.linenumbers=false
PERCENT       CPU SAMPLES   STACK
10.98%        4907          jdk.internal.misc.ScopedMemoryAccess#getByteInternal()
6.05%         2702          jdk.incubator.vector.FloatVector#reduceLanesTemplate()
4.06%         1813          org.apache.lucene.store.MemorySegmentIndexInput#readByte()
3.63%         1622          perf.PKLookupTask#go()
2.89%         1292          org.apache.lucene.store.DataInput#readVInt()
2.84%         1269          org.apache.lucene.util.hnsw.HnswGraphSearcher#searchLevel()
2.58%         1153          org.apache.lucene.util.fst.FST#findTargetArc()
2.45%         1093          jdk.incubator.vector.FloatVector#fromArray0Template()
2.28%         1018          org.apache.lucene.util.LongHeap#downHeap()
2.25%         1007          org.apache.lucene.util.SparseFixedBitSet#insertLong()
2.06%         921           org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnum#seekExact()
2.05%         916           jdk.internal.foreign.AbstractMemorySegmentImpl#checkBounds()
1.96%         875           jdk.incubator.vector.FloatVector#lanewiseTemplate()
1.91%         852           jdk.internal.util.ArraysSupport#mismatch()
1.40%         627           org.apache.lucene.util.compress.LZ4#decompress()
1.31%         586           jdk.internal.foreign.MemorySessionImpl#checkValidStateRaw()
1.18%         526           org.apache.lucene.util.SparseFixedBitSet#getAndSet()
1.15%         516           org.apache.lucene.codecs.lucene99.Lucene99HnswVectorsReader$OffHeapHnswGraph#seek()
0.99%         444           org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader$BlockState#doReset()
0.99%         441           org.apache.lucene.util.BytesRef#compareTo()
0.94%         418           org.apache.lucene.util.fst.FST#readArcByDirectAddressing()
0.92%         413           org.apache.lucene.search.TopKnnCollector#topDocs()
0.91%         408           org.apache.lucene.index.VectorSimilarityFunction$2#compare()
0.90%         401           org.apache.lucene.codecs.hnsw.DefaultFlatVectorScorer$FloatVectorScorer#score()
0.80%         359           org.apache.lucene.index.SegmentInfo#maxDoc()
0.79%         353           java.util.Arrays#fill()
0.75%         334           org.apache.lucene.codecs.lucene99.Lucene99PostingsReader#decodeTerm()
0.73%         326           java.util.Arrays#compareUnsigned()
0.72%         324           org.apache.lucene.search.ReferenceManager#acquire()
0.71%         317           org.apache.lucene.store.DataInput#readVLong()

and this is HEAP:

PROFILE SUMMARY from 748 events (total: 38182M)
  tests.profile.mode=heap
  tests.profile.count=30
  tests.profile.stacksize=1
  tests.profile.linenumbers=false
PERCENT       HEAP SAMPLES  STACK
21.88%        8355M         java.util.concurrent.locks.AbstractQueuedSynchronizer#acquire()
13.50%        5154M         org.apache.lucene.util.ArrayUtil#growNoCopy()
9.31%         3556M         org.apache.lucene.util.SparseFixedBitSet#insertLong()
9.00%         3436M         perf.StatisticsHelper#startStatistics()
9.00%         3436M         java.util.ArrayList#iterator()
5.76%         2199M         org.apache.lucene.util.fst.ByteSequenceOutputs#read()
3.60%         1374M         org.apache.lucene.util.BytesRef#<init>()
3.56%         1357M         org.apache.lucene.codecs.lucene95.OffHeapFloatVectorValues#<init>()
3.52%         1345M         org.apache.lucene.util.ArrayUtil#growExact()
2.65%         1013M         org.apache.lucene.search.TopKnnCollector#topDocs()
2.50%         956M          java.util.concurrent.locks.AbstractQueuedSynchronizer#tryInitializeHead()
2.34%         893M          org.apache.lucene.util.SparseFixedBitSet#insertBlock()
1.98%         755M          org.apache.lucene.util.LongHeap#<init>()
1.51%         578M          java.util.logging.LogManager#reset()
1.51%         578M          java.util.concurrent.FutureTask#runAndReset()
1.51%         578M          jdk.jfr.internal.ShutdownHook#run()
1.21%         463M          jdk.internal.foreign.MappedMemorySegmentImpl#dup()
0.90%         343M          java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject#newConditionNode()
0.83%         315M          org.apache.lucene.util.SparseFixedBitSet#<init>()
0.60%         229M          org.apache.lucene.util.hnsw.FloatHeap#<init>()
0.52%         200M          org.apache.lucene.util.hnsw.FloatHeap#getHeap()
0.45%         171M          jdk.internal.misc.Unsafe#allocateUninitializedArray()
0.45%         171M          org.apache.lucene.util.packed.DirectMonotonicReader#getInstance()
0.45%         171M          org.apache.lucene.store.DataInput#readString()
0.22%         85M           org.apache.lucene.search.knn.TopKnnCollectorManager#newCollector()
0.22%         85M           org.apache.lucene.search.knn.MultiLeafKnnCollector#<init>()
0.15%         57M           org.apache.lucene.store.MemorySegmentIndexInput#buildSlice()
0.08%         28M           perf.TaskParser$TaskBuilder#parseVectorQuery()
0.07%         28M           java.util.regex.Pattern#matcher()
0.07%         28M           org.apache.lucene.search.TaskExecutor$TaskGroup#createTask()

Why are we reading individual bytes so intensively? And why is lock acquisition the top HEAP object creator!?

mikemccand avatar Jun 10 '24 15:06 mikemccand

Here's the perf.py I ran (just A/A):

import sys
sys.path.insert(0, '/l/util/src/python')

import competition

if __name__ == '__main__':
  sourceData = competition.sourceData('wikimediumall')

  sourceData.tasksFile = '/l/util/just-vector-search.tasks'
  comp = competition.Competition(taskRepeatCount=200)
  #comp.addTaskPattern('HighTerm$')                                                                                                                                                                    

  checkout = 'trunk'

  index = comp.newIndex(checkout, sourceData, numThreads=36, addDVFields=True,
                        grouping=False, useCMS=True,
                        #javaCommand='/opt/jdk-18-ea-28/bin/java --add-modules jdk.incubator.foreign -Xmx32g -Xms32g -server -XX:+UseParallelGC -Djava.io.tmpdir=/l/tmp',                              
                        ramBufferMB=256,
                        analyzer = 'StandardAnalyzerNoStopWords',
                        vectorFile = '/lucenedata/enwiki/cohere-wikipedia-768.vec',
                        vectorDimension = 768,
                        hnswThreadsPerMerge = 4,
                        hnswThreadPoolCount = 16,
                        vectorEncoding = 'FLOAT32',
                        verbose = True,
                        name = 'mikes-vector-test',
                        facets = (('taxonomy:Date', 'Date'),
                                  ('taxonomy:Month', 'Month'),
                                  ('taxonomy:DayOfYear', 'DayOfYear'),
                                  ('taxonomy:RandomLabel.taxonomy', 'RandomLabel'),
                                  ('sortedset:Date', 'Date'),
                                  ('sortedset:Month', 'Month'),
                                  ('sortedset:DayOfYear', 'DayOfYear'),
                                  ('sortedset:RandomLabel.sortedset', 'RandomLabel')))

  comp.competitor('base', checkout, index=index, vectorFileName='/lucenedata/enwiki/cohere-wikipedia-queries-768.vec', vectorDimension=768,
                  #javacCommand='/opt/jdk-18-ea-28/bin/javac',                                                                                                                                         
                  #javaCommand='/opt/jdk-18-ea-28/bin/java --add-modules jdk.incubator.foreign -Xmx32g -Xms32g -server -XX:+UseParallelGC -Djava.io.tmpdir=/l/tmp')                                    
                  )
  comp.competitor('comp', checkout, index=index, vectorFileName='/lucenedata/enwiki/cohere-wikipedia-queries-768.vec', vectorDimension=768,
                  #javacCommand='/opt/jdk-18-ea-28/bin/javac',                                                                                                                                         
                  #javaCommand='/opt/jdk-18-ea-28/bin/java --add-modules jdk.incubator.foreign -Xmx32g -Xms32g -server -XX:+UseParallelGC -Djava.io.tmpdir=/l/tmp')                                    
                  )
  comp.benchmark('atoa')

mikemccand avatar Jun 10 '24 15:06 mikemccand

More thread context for the CPU profiling:

PROFILE SUMMARY from 10264 events (total: 10264)
  tests.profile.mode=cpu
  tests.profile.count=50
  tests.profile.stacksize=8
  tests.profile.linenumbers=false
PERCENT       CPU SAMPLES   STACK
12.59%        1292          jdk.incubator.vector.FloatVector#reduceLanesTemplate()
                              at jdk.incubator.vector.Float256Vector#reduceLanes()
                              at org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport#dotProductBody()
                              at org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport#dotProduct()
                              at org.apache.lucene.util.VectorUtil#dotProduct()
                              at org.apache.lucene.index.VectorSimilarityFunction$2#compare()
                              at org.apache.lucene.codecs.hnsw.DefaultFlatVectorScorer$FloatVectorScorer#score()
                              at org.apache.lucene.util.hnsw.HnswGraphSearcher#searchLevel()
7.16%         735           org.apache.lucene.store.DataInput#readVInt()
                              at org.apache.lucene.store.MemorySegmentIndexInput#readVInt()
                              at org.apache.lucene.codecs.lucene99.Lucene99HnswVectorsReader$OffHeapHnswGraph#seek()
                              at org.apache.lucene.util.hnsw.HnswGraphSearcher#graphSeek()
                              at org.apache.lucene.util.hnsw.HnswGraphSearcher#searchLevel()
                              at org.apache.lucene.util.hnsw.HnswGraphSearcher#search()
                              at org.apache.lucene.util.hnsw.HnswGraphSearcher#search()
                              at org.apache.lucene.codecs.lucene99.Lucene99HnswVectorsReader#search()
4.25%         436           jdk.incubator.vector.FloatVector#lanewiseTemplate()
                              at jdk.incubator.vector.Float256Vector#lanewise()
                              at jdk.incubator.vector.Float256Vector#lanewise()
                              at jdk.incubator.vector.FloatVector#fma()
                              at org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport#fma()
                              at org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport#dotProductBody()
                              at org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport#dotProduct()
                              at org.apache.lucene.util.VectorUtil#dotProduct()
4.21%         432           jdk.internal.misc.ScopedMemoryAccess#getByteInternal()
                              at jdk.internal.misc.ScopedMemoryAccess#getByte()
                              at java.lang.invoke.VarHandleSegmentAsBytes#get()
                              at java.lang.invoke.VarHandleGuards#guard_LJ_I()
                              at java.lang.foreign.MemorySegment#get()
                              at org.apache.lucene.store.MemorySegmentIndexInput#readByte()
                              at org.apache.lucene.store.DataInput#readVInt()
                              at org.apache.lucene.store.MemorySegmentIndexInput#readVInt()
4.08%         419           org.apache.lucene.util.SparseFixedBitSet#insertLong()
                              at org.apache.lucene.util.SparseFixedBitSet#getAndSet()
                              at org.apache.lucene.util.hnsw.HnswGraphSearcher#searchLevel()
                              at org.apache.lucene.util.hnsw.HnswGraphSearcher#search()
                              at org.apache.lucene.util.hnsw.HnswGraphSearcher#search()
                              at org.apache.lucene.codecs.lucene99.Lucene99HnswVectorsReader#search()
                              at org.apache.lucene.codecs.perfield.PerFieldKnnVectorsFormat$FieldsReader#search()
                              at org.apache.lucene.index.CodecReader#searchNearestVectors()
2.84%         292           org.apache.lucene.util.LongHeap#downHeap()
                              at org.apache.lucene.util.LongHeap#pop()
                              at org.apache.lucene.util.hnsw.NeighborQueue#pop()
                              at org.apache.lucene.search.TopKnnCollector#topDocs()
                              at org.apache.lucene.search.knn.MultiLeafKnnCollector#topDocs()
                              at org.apache.lucene.search.KnnFloatVectorQuery#approximateSearch()
                              at org.apache.lucene.search.AbstractKnnVectorQuery#getLeafResults()
                              at org.apache.lucene.search.AbstractKnnVectorQuery#searchLeaf()
2.74%         281           org.apache.lucene.index.VectorSimilarityFunction$2#compare()
                              at org.apache.lucene.codecs.hnsw.DefaultFlatVectorScorer$FloatVectorScorer#score()
                              at org.apache.lucene.util.hnsw.HnswGraphSearcher#searchLevel()
                              at org.apache.lucene.util.hnsw.HnswGraphSearcher#search()
                              at org.apache.lucene.util.hnsw.HnswGraphSearcher#search()
                              at org.apache.lucene.codecs.lucene99.Lucene99HnswVectorsReader#search()
                              at org.apache.lucene.codecs.perfield.PerFieldKnnVectorsFormat$FieldsReader#search()
                              at org.apache.lucene.index.CodecReader#searchNearestVectors()
2.58%         265           org.apache.lucene.util.compress.LZ4#decompress()
                              at org.apache.lucene.codecs.lucene90.LZ4WithPresetDictCompressionMode$LZ4WithPresetDictDecompressor#decompress()
                              at org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader$BlockState#document()
                              at org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader#serializedDocument()
                              at org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader#document()
                              at org.apache.lucene.index.CodecReader$1#document()
                              at org.apache.lucene.index.BaseCompositeReader$2#document()
                              at org.apache.lucene.index.StoredFields#document()
2.48%         255           org.apache.lucene.util.SparseFixedBitSet#getAndSet()
                              at org.apache.lucene.util.hnsw.HnswGraphSearcher#searchLevel()
                              at org.apache.lucene.util.hnsw.HnswGraphSearcher#search()
                              at org.apache.lucene.util.hnsw.HnswGraphSearcher#search()
                              at org.apache.lucene.codecs.lucene99.Lucene99HnswVectorsReader#search()
                              at org.apache.lucene.codecs.perfield.PerFieldKnnVectorsFormat$FieldsReader#search()
                              at org.apache.lucene.index.CodecReader#searchNearestVectors()
                              at org.apache.lucene.search.KnnFloatVectorQuery#approximateSearch()

Curious that readVInt, when seeking to load a vector (?) is 2nd hotspot?

mikemccand avatar Jun 10 '24 18:06 mikemccand

VInts are used to encode the HNSW graph, so it looks like decoding the graph is where that is happening (vis

                          at

org.apache.lucene.codecs.lucene99.Lucene99HnswVectorsReader$OffHeapHnswGraph#seek())

On Mon, Jun 10, 2024 at 2:12 PM Michael McCandless @.***> wrote:

More thread context for the CPU profiling:

PROFILE SUMMARY from 10264 events (total: 10264) tests.profile.mode=cpu tests.profile.count=50 tests.profile.stacksize=8 tests.profile.linenumbers=false PERCENT CPU SAMPLES STACK 12.59% 1292 jdk.incubator.vector.FloatVector#reduceLanesTemplate() at jdk.incubator.vector.Float256Vector#reduceLanes() at org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport#dotProductBody() at org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport#dotProduct() at org.apache.lucene.util.VectorUtil#dotProduct() at org.apache.lucene.index.VectorSimilarityFunction$2#compare() at org.apache.lucene.codecs.hnsw.DefaultFlatVectorScorer$FloatVectorScorer#score() at org.apache.lucene.util.hnsw.HnswGraphSearcher#searchLevel() 7.16% 735 org.apache.lucene.store.DataInput#readVInt() at org.apache.lucene.store.MemorySegmentIndexInput#readVInt() at org.apache.lucene.codecs.lucene99.Lucene99HnswVectorsReader$OffHeapHnswGraph#seek() at org.apache.lucene.util.hnsw.HnswGraphSearcher#graphSeek() at org.apache.lucene.util.hnsw.HnswGraphSearcher#searchLevel() at org.apache.lucene.util.hnsw.HnswGraphSearcher#search() at org.apache.lucene.util.hnsw.HnswGraphSearcher#search() at org.apache.lucene.codecs.lucene99.Lucene99HnswVectorsReader#search() 4.25% 436 jdk.incubator.vector.FloatVector#lanewiseTemplate() at jdk.incubator.vector.Float256Vector#lanewise() at jdk.incubator.vector.Float256Vector#lanewise() at jdk.incubator.vector.FloatVector#fma() at org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport#fma() at org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport#dotProductBody() at org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport#dotProduct() at org.apache.lucene.util.VectorUtil#dotProduct() 4.21% 432 jdk.internal.misc.ScopedMemoryAccess#getByteInternal() at jdk.internal.misc.ScopedMemoryAccess#getByte() at java.lang.invoke.VarHandleSegmentAsBytes#get() at java.lang.invoke.VarHandleGuards#guard_LJ_I() at java.lang.foreign.MemorySegment#get() at org.apache.lucene.store.MemorySegmentIndexInput#readByte() at org.apache.lucene.store.DataInput#readVInt() at org.apache.lucene.store.MemorySegmentIndexInput#readVInt() 4.08% 419 org.apache.lucene.util.SparseFixedBitSet#insertLong() at org.apache.lucene.util.SparseFixedBitSet#getAndSet() at org.apache.lucene.util.hnsw.HnswGraphSearcher#searchLevel() at org.apache.lucene.util.hnsw.HnswGraphSearcher#search() at org.apache.lucene.util.hnsw.HnswGraphSearcher#search() at org.apache.lucene.codecs.lucene99.Lucene99HnswVectorsReader#search() at org.apache.lucene.codecs.perfield.PerFieldKnnVectorsFormat$FieldsReader#search() at org.apache.lucene.index.CodecReader#searchNearestVectors() 2.84% 292 org.apache.lucene.util.LongHeap#downHeap() at org.apache.lucene.util.LongHeap#pop() at org.apache.lucene.util.hnsw.NeighborQueue#pop() at org.apache.lucene.search.TopKnnCollector#topDocs() at org.apache.lucene.search.knn.MultiLeafKnnCollector#topDocs() at org.apache.lucene.search.KnnFloatVectorQuery#approximateSearch() at org.apache.lucene.search.AbstractKnnVectorQuery#getLeafResults() at org.apache.lucene.search.AbstractKnnVectorQuery#searchLeaf() 2.74% 281 org.apache.lucene.index.VectorSimilarityFunction$2#compare() at org.apache.lucene.codecs.hnsw.DefaultFlatVectorScorer$FloatVectorScorer#score() at org.apache.lucene.util.hnsw.HnswGraphSearcher#searchLevel() at org.apache.lucene.util.hnsw.HnswGraphSearcher#search() at org.apache.lucene.util.hnsw.HnswGraphSearcher#search() at org.apache.lucene.codecs.lucene99.Lucene99HnswVectorsReader#search() at org.apache.lucene.codecs.perfield.PerFieldKnnVectorsFormat$FieldsReader#search() at org.apache.lucene.index.CodecReader#searchNearestVectors() 2.58% 265 org.apache.lucene.util.compress.LZ4#decompress() at org.apache.lucene.codecs.lucene90.LZ4WithPresetDictCompressionMode$LZ4WithPresetDictDecompressor#decompress() at org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader$BlockState#document() at org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader#serializedDocument() at org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader#document() at org.apache.lucene.index.CodecReader$1#document() at org.apache.lucene.index.BaseCompositeReader$2#document() at org.apache.lucene.index.StoredFields#document() 2.48% 255 org.apache.lucene.util.SparseFixedBitSet#getAndSet() at org.apache.lucene.util.hnsw.HnswGraphSearcher#searchLevel() at org.apache.lucene.util.hnsw.HnswGraphSearcher#search() at org.apache.lucene.util.hnsw.HnswGraphSearcher#search() at org.apache.lucene.codecs.lucene99.Lucene99HnswVectorsReader#search() at org.apache.lucene.codecs.perfield.PerFieldKnnVectorsFormat$FieldsReader#search() at org.apache.lucene.index.CodecReader#searchNearestVectors() at org.apache.lucene.search.KnnFloatVectorQuery#approximateSearch()

Curious that readVInt, when seeking to load a vector (?) is 2nd hotspot?

— Reply to this email directly, view it on GitHub https://github.com/mikemccand/luceneutil/issues/256#issuecomment-2159000055, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHHUQKVIMMHESS7RMPDGLDZGXUBXAVCNFSM6AAAAABEHTEKSOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNJZGAYDAMBVGU . You are receiving this because you were mentioned.Message ID: @.***>

msokolov avatar Jun 10 '24 18:06 msokolov

I might be missing it, but where is the similarity defined for using the cohere vectors? They are designed for max inner product, if we use euclidean, I would expect graph building and indexing to be poor as we might get stuck in local minima.

benwtrent avatar Oct 17 '24 19:10 benwtrent

The benchmark tools are hard-coded to use DOT_PRODUCT; see https://github.com/mikemccand/luceneutil/blob/main/src/main/perf/LineFileDocs.java#L454

Maybe this is why we get such poor results w/Cohere?

msokolov avatar Oct 18 '24 18:10 msokolov

@msokolov using dot_product likely doesn't work with 768 cohere unless they are manually normalized. If these things aren't normalized, we will be getting some whacky scores and we likely lose a bunch of information by snapping to be greater than 0.

I could maybe see cosine working.

But I would suggest we switch to max-inner-product for Cohere 768 for a true test with those vectors as they were designed to be used.

benwtrent avatar Oct 18 '24 18:10 benwtrent

I ran a test comparing mip and angular over Cohere Wikipedia vectors (what KnnGraphTester calls MAXIMUM_INNER_PRODUCT and DOT_PRODUCT) and the results were surprising:

mainline, Cohere, angular

recall  latency (ms)     nDoc  topK  fanout  maxConn  beamWidth  quantized  index s  force merge s  num segments  index size (MB)
 0.631         0.496  1500000    10       6       32         50         no   330.94         213.55             1          4436.82
 0.617         0.439  1500000    10       6       32         50     7 bits   352.52         217.64             1          5543.35
 0.408         0.422  1500000    10       6       32         50     4 bits   340.32         151.22             1          5544.56

mainline, Cohere, mip

recall  latency (ms)     nDoc  topK  fanout  maxConn  beamWidth  quantized  index s  force merge s  num segments  index size (MB)
 0.593         0.475  1500000    10       6       32         50         no   325.19         210.78             1          4436.81
 0.601         0.454  1500000    10       6       32         50     7 bits   346.48         218.88             1          5543.35
 0.405         0.307  1500000    10       6       32         50     4 bits   345.31         144.83             1          5544.56

msokolov avatar Oct 29 '24 14:10 msokolov

I think we're pretty much use Cohere embeddings for most knn benchmarks, including nightlies? Should we change the default similarity to mip in knnGraphTester?

vigyasharma avatar Aug 23 '25 04:08 vigyasharma