kernel-memory
kernel-memory copied to clipboard
PosgreSQL hybrid search
This PR Includes:
- Text search index
- Hybrid Search configuration
- Vector or hybrid search
Motivation and Context (Why the change? What's the scenario?)
Vector search do not provide the best results in an important number of scenarios. Hybrid search provides better results.
High level description (Approach, Design)
This change includes a new parameter for activate hybrid search on PostgreSQL extension. This parameters defaults to the previous implementation (vector search). The search will use vector search or hybrid search depending on the parameter.
@microsoft-github-policy-service agree
Looks like the PR got stale, with some unsolved errors and comments. We might have to archive it unless someone can kindly complete the task.
Looks like the PR got stale, with some unsolved errors and comments. We might have to archive it unless someone can kindly complete the task.
I will resolve the comments this weekend.
TODO:
- when hybrid search is disabled (default) the code needs to be backward compatible, otherwise people using Postgres and upgrading KM will get a broken deployment.
- We need to define what happens if one enables hybrid search on an existing deployment, considering that KM will have to deal with tables with different schemas
Suggestions for improvement:
-- Add a setting to specify normalization in ts_rank_cd(): https://www.postgresql.org/docs/current/textsearch-controls.html#TEXTSEARCH-RANKING
ts_rank_cd([ weights float4[], ] vector tsvector, query tsquery [, normalization integer ]) returns float4
normalization - this is an important parameter (however, then the calculation of the relevance value will change.).
0 (the default) ignores the document length 1 divides the rank by 1 + the logarithm of the document length 2 divides the rank by the document length 4 divides the rank by the mean harmonic distance between extents (this is implemented only by ts_rank_cd) 8 divides the rank by the number of unique words in document 16 divides the rank by 1 + the logarithm of the number of unique words in document 32 divides the rank by itself + 1
-- Add a setting to specify the Parsing query (websearch_to_tsquery, plainto_tsquery, phraseto_tsquery): https://www.postgresql.org/docs/current/textsearch-controls.html#TEXTSEARCH-PARSING-QUERIES
-- Add minRelevance value settings separately for each search mode
-- I suggest using my example of creating a table:
-
add a column for quick search: content_vector (Convert the search to column usage!)
-
add an index for embedding search: idx_embedding
BEGIN; SELECT pg_advisory_xact_lock(%%lock_id%%); CREATE TABLE IF NOT EXISTS %%table_name%% ( id TEXT NOT NULL PRIMARY KEY, tags TEXT[] DEFAULT '{{}}'::TEXT[] NOT NULL, payload JSONB DEFAULT '{{}}'::JSONB NOT NULL, content TEXT DEFAULT '' NOT NULL, embedding vector(%%vector_size%%), content_vector tsvector GENERATED ALWAYS AS (to_tsvector('{TextSearchLanguage}', coalesce(content, ''))) STORED ); CREATE INDEX IF NOT EXISTS idx_tags ON %%table_name%% USING gin (tags); CREATE INDEX IF NOT EXISTS idx_content_vector ON %%table_name%% USING gin (content_vector); CREATE INDEX IF NOT EXISTS idx_embedding ON %%table_name%% USING hnsw (embedding vector_cosine_ops); COMMIT;
-- It is important that the new hybrid search does not interfere with work without it. Including with a custom SQL script for creating a table!
The relevance values of different searches are calculated differently. They should not be mixed! The FTS results should come first, followed by the semantic search results.
And it is also possible that there will be several Text Search Languages!
TODO:
- when hybrid search is disabled (default) the code needs to be backward compatible, otherwise people using Postgres and upgrading KM will get a broken deployment.
- We need to define what happens if one enables hybrid search on an existing deployment, considering that KM will have to deal with tables with different schemas
@dluc ,
1.In my opinion the code is backward compatible. 2. I add a note in the documentation to explain how to create the index manually. This should also be done if you change the language of the index.
Parking this for now, while we wait for SK Vector Store release. This feature should be available for free.
@dluc ,
I was unable to find hybrid search support in Semantic Kernel. The only reference I came across is this issue, which includes the following comment:
I will move this PR to an Extensions Package to be able use it.
@SignalRT yup, the work is still in progress, it hasn't started for PG but you can see how it will work checking the Azure AI Search connector:
https://github.com/microsoft/semantic-kernel/blob/e96c693851a2e4be14a9e7fd835c4087b063dc58/dotnet/src/Connectors/Connectors.Memory.AzureAISearch/AzureAISearchVectorStoreRecordCollection.cs#L440-L489
The work remaining for PG is implementing IKeywordHybridSearch in PostgresVectorStoreRecordCollection
https://github.com/microsoft/semantic-kernel/blob/e96c693851a2e4be14a9e7fd835c4087b063dc58/dotnet/src/Connectors/Connectors.Memory.Postgres/PostgresVectorStoreRecordCollection.cs#L20C14-L20C49