NGT icon indicating copy to clipboard operation
NGT copied to clipboard

Warnings while creating cosine based index

Open shriyog opened this issue 3 years ago • 5 comments

While building NGT index using the cosine distance metric, I see lot many warnings like below.

createIndex: Warning. The specified number of edges could not be acquired, because the pruned parameter [-S] might be set.
  The node id=6651608
  The number of edges for the node=7
  The pruned parameter (edgeSizeForSearch [-S])=40

Created the index using this command where I don't specify any -S param (default is 40)

ngt create -d 40 -D c cosine-index
ngt append -d 40 cosine-index vectors.ssv

I feel this suspicious as there are differences compared to another index built with L2 (Euclidean) distance metric using the same input vectors.

  1. Index build time - 4 Mins (cosine) vs 45 Mins (L2)
  2. Epsilon vs Precision (mentioned below)
  3. Index size on disk is the same though
Euclidean
# Factor (Epsilon)      # of Queries    Precision       Time(msec)      # of computations       # of visted nodes
0       100     0.436   0.293037        0       0
0.01    100     0.55    0.0437106       0       0
0.02    100     0.664   0.0645273       0       0
0.03    100     0.802   100.782         0       0
0.04    100     0.889   728.165         0       0
0.05    100     0.932   2077.52         0       0
0.06    100     0.958   3091.21         0       0
0.07    100     0.973   4509.79         0       0
0.08    100     0.985   5053.05         0       0
0.09    100     0.988   5463.39         0       0
0.1     100     0.993   5964.26         0       0

Cosine
# Factor (Epsilon)      # of Queries    Precision       Time(msec)      # of computations       # of visted nodes
0       100     0.256   0.0588535       0       0
0.01    100     0.273   0.033929        0       0
0.02    100     0.278   0.0337207       0       0
0.03    100     0.286   0.0346833       0       0
0.04    100     0.295   0.0367112       0       0
0.05    100     0.318   0.0401136       0       0
0.06    100     0.355   0.0426844       0       0
0.07    100     0.384   0.0472755       0       0
0.08    100     0.394   0.0479118       0       0
0.09    100     0.415   0.0516687       0       0
0.1     100     0.441   0.057455        0       0

The warning seems to be originating from here due to which I think the cosine based index is not properly built hence the impact on accuracy. Any thoughts on this or it's expected?

shriyog avatar Dec 21 '21 14:12 shriyog

Could you run the command below to get your index's information.

ngt info [your cosine index path]

masajiro avatar Dec 22 '21 06:12 masajiro

I tried to reproduce your problem with the datasets I have, but I could not. Since the problem might depend on datasets, could you provide your dataset, if possible.

masajiro avatar Dec 23 '21 22:12 masajiro

Hey @masajiro — Thanks for the command, it details out the index meta which is quite helpful.

This is the output for an index created with above-mentioned warnings.

> ngt info catalog-mod-0-cosine/
NGT version: 1.13.7
Processed 1000000
Processed 2000000
Processed 3000000
Processed 4000000
Processed 5000000
Processed 6000000
The size of the object repository (not the number of the objects):	6652051
The number of the removed objects:	0/6652051
The number of the nodes:	6652051
The number of the edges:	130936766
The mean of the edge lengths:	-nan
The mean of the number of the edges per node:	19.68366839
The number of the nodes without edges:	0
The maximum of the outdegrees:	139690
The minimum of the outdegrees:	10
The number of the nodes where indegree is 0:	0
The maximum of the indegrees:	139690
The minimum of the indegrees:	10
#-nodes,#-edges,#-no-indegree,avg-edges,avg-dist,max-out,min-out,v-out,max-in,min-in,v-in,med-out,med-in,mode-out,mode-in,c95,c5,o-distance(10),o-skip,i-distance(10),i-skip:6652051:130936766:0:19.68366839:-nan:139690:10:1432.146574:139690:10:1432.146574:10:10:10:10:136.0223814:10:198.2695696:10:-nan:0:-nan:0

The dataset had empty vectors which may or may not be the reason for warnings. I created another index with a clean 1 Mn vectors & it didn't give any warnings this time. Here's the command output for it.

> ngt info catalog-1m-clean-cosine/
NGT version: 1.13.7
Processed 1000000
The size of the object repository (not the number of the objects):      1000000
The number of the removed objects:      0/1000000
The number of the nodes:        1000000
The number of the edges:        19999890
The mean of the edge lengths:   0.2193799515
The mean of the number of the edges per node:   19.99989
The number of the nodes without edges:  0
The maximum of the outdegrees:  3598
The minimum of the outdegrees:  10
The number of the nodes where indegree is 0:    0
The maximum of the indegrees:   3598
The minimum of the indegrees:   10
#-nodes,#-edges,#-no-indegree,avg-edges,avg-dist,max-out,min-out,v-out,max-in,min-in,v-in,med-out,med-in,mode-out,mode-in,c95,c5,o-distance(10),o-skip,i-distance(10),i-skip:1000000:19999890:0:19.99989:0.2193799515:3598:10:29.96648422:3598:10:29.96648422:13:13:10:10:92.58104:10:177.7591:10:0.2021325693:0:0.2021325693:0

Also, want to mention that the optimization guide helped me a lot to achieve desired accuracy & performance with the ONNG index. Thanks a lot for putting it together.

shriyog avatar Dec 24 '21 06:12 shriyog

The dataset is 6.6 Mn, I'll try to reproduce the issue with a minimal dataset & share it with you. Let me get back on this by Monday.

shriyog avatar Dec 24 '21 06:12 shriyog

Did you solve this issue?

masajiro avatar Jan 24 '22 22:01 masajiro