kernel-memory icon indicating copy to clipboard operation
kernel-memory copied to clipboard

PosgreSQL hybrid search

Open SignalRT opened this issue 11 months ago • 9 comments

This PR Includes:

  • Text search index
  • Hybrid Search configuration
  • Vector or hybrid search

Motivation and Context (Why the change? What's the scenario?)

Vector search do not provide the best results in an important number of scenarios. Hybrid search provides better results.

High level description (Approach, Design)

This change includes a new parameter for activate hybrid search on PostgreSQL extension. This parameters defaults to the previous implementation (vector search). The search will use vector search or hybrid search depending on the parameter.

SignalRT avatar Dec 29 '24 11:12 SignalRT

@microsoft-github-policy-service agree

SignalRT avatar Dec 29 '24 11:12 SignalRT

Looks like the PR got stale, with some unsolved errors and comments. We might have to archive it unless someone can kindly complete the task.

dluc avatar Jan 29 '25 09:01 dluc

Looks like the PR got stale, with some unsolved errors and comments. We might have to archive it unless someone can kindly complete the task.

I will resolve the comments this weekend.

SignalRT avatar Jan 29 '25 12:01 SignalRT

TODO:

  1. when hybrid search is disabled (default) the code needs to be backward compatible, otherwise people using Postgres and upgrading KM will get a broken deployment.
  2. We need to define what happens if one enables hybrid search on an existing deployment, considering that KM will have to deal with tables with different schemas

dluc avatar Feb 11 '25 07:02 dluc

Suggestions for improvement:

-- Add a setting to specify normalization in ts_rank_cd(): https://www.postgresql.org/docs/current/textsearch-controls.html#TEXTSEARCH-RANKING

ts_rank_cd([ weights float4[], ] vector tsvector, query tsquery [, normalization integer ]) returns float4

normalization - this is an important parameter (however, then the calculation of the relevance value will change.).

0 (the default) ignores the document length 1 divides the rank by 1 + the logarithm of the document length 2 divides the rank by the document length 4 divides the rank by the mean harmonic distance between extents (this is implemented only by ts_rank_cd) 8 divides the rank by the number of unique words in document 16 divides the rank by 1 + the logarithm of the number of unique words in document 32 divides the rank by itself + 1

-- Add a setting to specify the Parsing query (websearch_to_tsquery, plainto_tsquery, phraseto_tsquery): https://www.postgresql.org/docs/current/textsearch-controls.html#TEXTSEARCH-PARSING-QUERIES

-- Add minRelevance value settings separately for each search mode

-- I suggest using my example of creating a table:

  • add a column for quick search: content_vector (Convert the search to column usage!)

  • add an index for embedding search: idx_embedding

        BEGIN;
        SELECT pg_advisory_xact_lock(%%lock_id%%);
    
        CREATE TABLE IF NOT EXISTS %%table_name%% (
          id                TEXT NOT NULL PRIMARY KEY,
          tags              TEXT[] DEFAULT '{{}}'::TEXT[] NOT NULL,
          payload           JSONB DEFAULT '{{}}'::JSONB NOT NULL,
          content           TEXT DEFAULT '' NOT NULL,
          embedding         vector(%%vector_size%%),
          content_vector    tsvector GENERATED ALWAYS AS (to_tsvector('{TextSearchLanguage}', coalesce(content, ''))) STORED
        );
    
        CREATE INDEX IF NOT EXISTS idx_tags ON %%table_name%% USING gin (tags);
        CREATE INDEX IF NOT EXISTS idx_content_vector ON %%table_name%% USING gin (content_vector);
        CREATE INDEX IF NOT EXISTS idx_embedding ON %%table_name%% USING hnsw (embedding vector_cosine_ops);
    
        COMMIT;
    

-- It is important that the new hybrid search does not interfere with work without it. Including with a custom SQL script for creating a table!

The relevance values of different searches are calculated differently. They should not be mixed! The FTS results should come first, followed by the semantic search results.

And it is also possible that there will be several Text Search Languages!

aropb avatar Feb 11 '25 08:02 aropb

TODO:

  1. when hybrid search is disabled (default) the code needs to be backward compatible, otherwise people using Postgres and upgrading KM will get a broken deployment.
  2. We need to define what happens if one enables hybrid search on an existing deployment, considering that KM will have to deal with tables with different schemas

@dluc ,

1.In my opinion the code is backward compatible. 2. I add a note in the documentation to explain how to create the index manually. This should also be done if you change the language of the index.

SignalRT avatar Mar 16 '25 16:03 SignalRT

Parking this for now, while we wait for SK Vector Store release. This feature should be available for free.

dluc avatar Mar 22 '25 22:03 dluc

@dluc ,

I was unable to find hybrid search support in Semantic Kernel. The only reference I came across is this issue, which includes the following comment:

image

I will move this PR to an Extensions Package to be able use it.

SignalRT avatar Mar 23 '25 21:03 SignalRT

@SignalRT yup, the work is still in progress, it hasn't started for PG but you can see how it will work checking the Azure AI Search connector:

https://github.com/microsoft/semantic-kernel/blob/e96c693851a2e4be14a9e7fd835c4087b063dc58/dotnet/src/Connectors/Connectors.Memory.AzureAISearch/AzureAISearchVectorStoreRecordCollection.cs#L440-L489

The work remaining for PG is implementing IKeywordHybridSearch in PostgresVectorStoreRecordCollection

https://github.com/microsoft/semantic-kernel/blob/e96c693851a2e4be14a9e7fd835c4087b063dc58/dotnet/src/Connectors/Connectors.Memory.Postgres/PostgresVectorStoreRecordCollection.cs#L20C14-L20C49

dluc avatar Mar 23 '25 23:03 dluc