VectorHub icon indicating copy to clipboard operation
VectorHub copied to clipboard

pgvector dim size

Open nathan-vo810 opened this issue 1 year ago • 9 comments

The limit for the vector type is 16,000 dimensions (docs). 2,000 is the limit for indexing it (you'll see an error if you try).

nathan-vo810 avatar Sep 30 '24 20:09 nathan-vo810

Can you create a PR changing it in this file. Updating the vector_dims property, adding the doc as the source URL. And then @dhruv-anand-aintech can review it.

AruneshSingh avatar Oct 01 '24 09:10 AruneshSingh

Hi @nathan-vo810,

Thanks for updating here.

2,000 is the limit for indexing it

I think it would be reasonable to keep the primary field as 2000 then. A viewer of this table should not have to try vector search with a 2k+ vector on pgvector and then see the error after we listed 16k.

At max, it can be added as a note in the comment section of the cell.

dhruv-anand-aintech avatar Oct 01 '24 09:10 dhruv-anand-aintech

As a user, when I first saw the vector size limit, I assumed it wouldn’t be possible to store embeddings larger than 2000 dimensions. However, it turns out it can, and the vector search (nearest neighbor search) works just fine.

The only aspect affected by the 2000-dimension limit is indexing.

In any case, it’s up to you to decide how to handle this. Thanks for providing such a comprehensive tool for comparison!

nathan-vo810 avatar Oct 01 '24 17:10 nathan-vo810

@nathan-vo810 so you are saying that full-scan search works with longer vectors, just the approximate nearest neighbor search doesn't?

I'd be curious what is your use-case for larger vectors and what does full-scan do to your latency, if you are open to share!

svonava-superlinked avatar Oct 01 '24 19:10 svonava-superlinked

In my current use case, I’m using Langchain with pgvector, specifically with the text-3-embedding-large model, which has a vector dimension of 3072. I’m able to perform similarity searches on the database without any issues.

The 2,000 limit only applies to indexing, so it doesn't prevent vector searches. What I'm trying to clarify is that the vector dim column in the table should have a dimension of 16,000 to match the vector type in pgvector.

On Tue, Oct 1, 2024 at 9:01 PM Daniel Svonava @.***> wrote:

@nathan-vo810 https://github.com/nathan-vo810 so you are saying that full-scan search works with longer vectors, just the approximate nearest neighbor search doesn't?

I'd be curious what is your use-case for larger vectors and what does full-scan do to your latency, if you are open to share!

— Reply to this email directly, view it on GitHub https://github.com/superlinked/VectorHub/issues/513#issuecomment-2386744350, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADLF2AVSKMUUSL33N7WNZCDZZLWPHAVCNFSM6AAAAABPEFJBXSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGOBWG42DIMZVGA . You are receiving this because you were mentioned.Message ID: @.***>

nathan-vo810 avatar Oct 01 '24 19:10 nathan-vo810

Got it! How many vectors do you have and what search latency do you observe?

svonava-superlinked avatar Oct 01 '24 19:10 svonava-superlinked

I don’t actually measure the latency, but with my current DB of 15,000 records, I think it’s less than 1s.

On Tue, Oct 1, 2024 at 9:58 PM Daniel Svonava @.***> wrote:

Got it! How many vectors do you have and what search latency do you observe?

— Reply to this email directly, view it on GitHub https://github.com/superlinked/VectorHub/issues/513#issuecomment-2386942682, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADLF2AVUZABSU425OMUR6IDZZL5EVAVCNFSM6AAAAABPEFJBXSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGOBWHE2DENRYGI . You are receiving this because you were mentioned.Message ID: @.***>

nathan-vo810 avatar Oct 01 '24 20:10 nathan-vo810

@nathan-vo810 got it - and probably quite low query-per-second? as in, sub 1 QPS?

@dhruv-anand-aintech I think the proper way to handle this would be to have dim limit per indexing algorithm (since now we have a list of supported algos), but that sounds like a nightmare to maintain..

alternatively, we add a comment for the dims column to clarify that this is for the ANN-type indexes.

What do you think?

svonava-superlinked avatar Oct 01 '24 20:10 svonava-superlinked

Yeah I would prefer the latter suggestion (clarify further in column description), as this kind of case is not common.

dhruv-anand-aintech avatar Oct 01 '24 20:10 dhruv-anand-aintech

Closing this issue as the latest commit adds the clarification for ANN-type indexes.

AruneshSingh avatar Feb 05 '25 14:02 AruneshSingh