matrixone icon indicating copy to clipboard operation
matrixone copied to clipboard

[Bug]: vector index benchmark performance is lower than pgvector

Open heni02 opened this issue 1 year ago • 14 comments

Is there an existing issue for the same bug?

  • [X] I have checked the existing issues.

Branch Name

main

Commit ID

b85d7579eec4ca041b63ecf8cf9ceef325f59071

Other Environment Information

- Hardware parameters:
- OS type:
- Others:

Actual Behavior

mo vs pgvector vector index benchmark test result: image and also see file:https://doc.weixin.qq.com/sheet/e3_AYYAgwazACsB4L72hnoQ1eAEU31N6?scode=AJsA6gc3AA8QPAY9JuAYYAgwazACs

from the test data, the following performance issues have been summarized:

  1. Comparison of query performance with and without index: mo almost has no difference QPS in sift 128dim 1million query with and without index, but there is a 6 times difference in PGvector between the two; The indexed performance of mo 960dim is significantly better than that without indexes

  2. QPS performance comparison: sift 128dim and gist960 1million, Mo recall is basically same as pgvector, but QPS is 30-40 times lower than pgvector

  3. Performance impact of different lists: Mo index recall which lists=500 is lower than lists=1000, but there is no significant improvement in QPS. However, pgvector QPS which lists=1000 is significantly better than lists=500

  4. Index creation performance: Sift 128 1million mo takes 10 times longer to create an index than pgvector; Gist 960 1million mo takes 25 times longer to create an index than pgvector

Expected Behavior

No response

Steps to Reproduce

1.use benchmark tool: git clone -b sift128 https://github.com/arjunsk/mo-benchmark-test.git
2.run benchmark tool with sift128dim and gist960dim dataset

Additional information

No response

heni02 avatar Mar 27 '24 07:03 heni02

Currently tracked in https://github.com/matrixorigin/matrixone/issues/14610 . Will update this issue once the provided issue is resolved.

arjunsk avatar Mar 27 '24 09:03 arjunsk

No progress.

arjunsk avatar Apr 01 '24 10:04 arjunsk

  1. For KNN QPS: //OK Waiting for this issue to be resolved: https://github.com/matrixorigin/matrixone/issues/15196 Once that is fixed, we should get around 50-60 QPS for SIFT128

  2. Create Index duration // Needs work Made some modifications to bring down the duration to 5-6 mins for SIFT128. Doing some more analysis.

  3. Insert before the Index is Created // OK Inserts SIFT128 data in 330 secs.

  4. Insert after Index is Created // Needs work This benchmark is required. Based on my local benchmark we take 30 minutes to reinsert the same SIFT128 data after the index is created for the table. Need more analysis.

arjunsk avatar Apr 10 '24 19:04 arjunsk

KNN can be improved after solving this: https://github.com/matrixorigin/matrixone/issues/15572

Create Index is improved after merging this: https://github.com/matrixorigin/matrixone/pull/15573

arjunsk avatar Apr 17 '24 09:04 arjunsk

Note: Please take the latest pull of https://github.com/arjunsk/mo-benchmark-test/tree/master_index

  1. SIFT 128 dataset
  • With PK
Load (1million) without INDEX 30sec
Create Index List = 500 3min 41 sec
KNN QPS (k = 5) with INDEX 52
KNN recall (k=5) with INDEX 0.7013
Reinsert (2million) with INDEX 2799 secs
Insert (1million) without INDEX 1292 secs
  • Without PK
Load (1million) without INDEX 30sec
Create Index List = 500 3mins 59sec
KNN QPS (k = 5) with INDEX 32
KNN recall (k=5) with INDEX 0.7162
Reinsert (2million) with INDEX 2504secs
Insert (1million) without INDEX 338 secs

arjunsk avatar Apr 18 '24 20:04 arjunsk

Hi @heni02 , most of the optimizations have been added to the master. Kindly verify the performance change.

arjunsk avatar Apr 18 '24 20:04 arjunsk

after vector index improvement test main commit:ad5d8c6c43a021760896df846fcf35dff93cfd8f performance results:

  1. good news: Compared to previous versions, there has been a significant improvement in index creation and QPS, about more than 1-29 times image
  2. small bad news: Compared to PG, index creation performance is 3-10 times lower, and Gist960 QPS is 2-10 times lower image

performance result:https://doc.weixin.qq.com/sheet/e3_AYYAgwazACsB4L72hnoQ1eAEU31N6?scode=AJsA6gc3AA8QPAY9JuAYYAgwazACs

heni02 avatar May 09 '24 11:05 heni02

For GIST960, the performance is bad. Arjun have created an issue for tracking the same: https://github.com/matrixorigin/matrixone/issues/16001

heni02 avatar May 13 '24 08:05 heni02

Performance is optimizating, move to the next version

heni02 avatar May 17 '24 03:05 heni02

pgvector version update to 0.7.2(postgres server version 14.4) ,retest pgvector benchmark performance, the newest version is better performance than previous 0.4.2 version ,see the results : https://doc.weixin.qq.com/sheet/e3_AYYAgwazACsB4L72hnoQ1eAEU31N6?scode=AJsA6gc3AA8QPAY9JuAYYAgwazACs&tab=jcbr8o image pgvector 0.7.2 download: https://pgxn.org/dist/vector/#query-options

postgres server version: 企业微信截图_b36e0172-7456-443f-b126-32f2b66ad1f1

heni02 avatar Jul 19 '24 07:07 heni02