langchain4j
langchain4j copied to clipboard
PgVector: support hybrid search
Issue
Closes # #1599
Change
Impeletement full-text search and hybrid search in dev.langchain4j.store.embedding.pgvector.PgVectorEmbeddingStore
General checklist
- [X] There are no breaking changes
- [X] I have added unit and integration tests for my change
- [X] I have manually run all the unit and integration tests in the module I have added/changed, and they are all green
- [X] I have manually run all the unit and integration tests in the core and main modules, and they are all green
Checklist for changing existing embedding store integration
- [X] I have manually verified that the
{NameOfIntegration}EmbeddingStoreworks correctly with the data persisted using the latest released version of LangChain4j
Hi @hrhrng, thanks a lot and sorry for the delay! You were right, I guess it is better to have it as a separate ContentRetriever implementation (the same way Azure AI search is done and the same way Elasticsearch is being implemented (in progress)).
BTW I've noticed there are a few other Postgre extensions for BM25/full-text search (e.g. pg_search, pg_bestmatch.rs, etc). Did you check/compare them? Did you also had a chance to use this implementation in real life?
Thank you!
@langchain4j Sure, I think full-text search can be abstract. I'll implement Gin index(which is default full-text search engine for PgSQL) first. I do run several app on PGVector and Gin index, but the amount of data is not large, so they're just fine.
Hey @hrhrng, how is it going? This is quite an important feature, I hope we could support it soon. Thanks a lot for your help! 🙏
On Dec 6, 2024, at 20:54, LangChain4j @.***> wrote:
Hey @hrhrng https://github.com/hrhrng, how is it going? This is quite an important feature, I hope we could support it soon. Thanks a lot for your help! 🙏
— Reply to this email directly, view it on GitHub https://github.com/langchain4j/langchain4j/pull/1633#issuecomment-2523185612, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANOYK4JGK4KGFK44QQMUFBD2EGM7HAVCNFSM6AAAAABNCWNXDKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKMRTGE4DKNRRGI. You are receiving this because you were mentioned.
@langchain4j Sorry for little busy previously. I’ll work on this feature next week!
@hrhrng great, thank you a lot!
Hi @langchain4j . I have changed the code according to the previous discussion, please take a check if the code meets the expectation.
@hrhrng thanks a lot and sorry for the late reply! I will try to review and merge it ASAP
@hrhrng BTW did you use this feature already on real data? How do you find it?
@hrhrng @dliubarskyi any updates on this? It's such a powerful feature that I'd like to start using soon.
@hrhrng @dliubarskyi any updates on this please? it's a great feature.
@hrhrng BTW did you use this feature already on real data? How do you find it?
@dliubarskyi Sorry for the late reply. I’ve been a bit busy recently. Actually, I haven’t tested this feature on real data yet; I only did some initial testing. Maybe you could invite someone else to try it with real data. Also, I’ve resolved the conflict—please have a look. Let me know if you have any other suggestion.