langchain
langchain copied to clipboard
Similarity search returns random docs, not the ones that contain the specified keywords
System Info (M1 mac)
Python implementation: CPython Python version : 3.11.4 IPython version : 8.14.0
Compiler : GCC 12.2.0 OS : Linux Release : 5.15.49-linuxkit-pr Machine : aarch64 Processor : CPU cores : 5 Architecture: 64bit
[('aiohttp', '3.8.4'), ('aiosignal', '1.3.1'), ('asttokens', '2.2.1'), ('async-timeout', '4.0.2'), ('attrs', '23.1.0'), ('backcall', '0.2.0'), ('blinker', '1.6.2'), ('certifi', '2023.5.7'), ('charset-normalizer', '3.2.0'), ('click', '8.1.4'), ('dataclasses-json', '0.5.9'), ('decorator', '5.1.1'), ('docarray', '0.35.0'), ('executing', '1.2.0'), ('faiss-cpu', '1.7.4'), ('flask', '2.3.2'), ('frozenlist', '1.3.3'), ('greenlet', '2.0.2'), ('idna', '3.4'), ('importlib-metadata', '6.8.0'), ('ipython', '8.14.0'), ('itsdangerous', '2.1.2'), ('jedi', '0.18.2'), ('jinja2', '3.1.2'), ('json5', '0.9.14'), ('langchain', '0.0.228'), ('langchainplus-sdk', '0.0.20'), ('markdown-it-py', '3.0.0'), ('markupsafe', '2.1.3'), ('marshmallow', '3.19.0'), ('marshmallow-enum', '1.5.1'), ('matplotlib-inline', '0.1.6'), ('mdurl', '0.1.2'), ('multidict', '6.0.4'), ('mypy-extensions', '1.0.0'), ('numexpr', '2.8.4'), ('numpy', '1.25.1'), ('openai', '0.27.8'), ('openapi-schema-pydantic', '1.2.4'), ('orjson', '3.9.2'), ('packaging', '23.1'), ('parso', '0.8.3'), ('pexpect', '4.8.0'), ('pickleshare', '0.7.5'), ('pip', '23.1.2'), ('prompt-toolkit', '3.0.39'), ('psycopg2-binary', '2.9.6'), ('ptyprocess', '0.7.0'), ('pure-eval', '0.2.2'), ('pydantic', '1.10.11'), ('pygments', '2.15.1'), ('python-dotenv', '1.0.0'), ('python-json-logger', '2.0.7'), ('pyyaml', '6.0'), ('regex', '2023.6.3'), ('requests', '2.31.0'), ('rich', '13.4.2'), ('setuptools', '65.5.1'), ('six', '1.16.0'), ('slack-bolt', '1.18.0'), ('slack-sdk', '3.21.3'), ('sqlalchemy', '2.0.18'), ('stack-data', '0.6.2'), ('tenacity', '8.2.2'), ('tiktoken', '0.4.0'), ('tqdm', '4.65.0'), ('traitlets', '5.9.0'), ('types-requests', '2.31.0.1'), ('types-urllib3', '1.26.25.13'), ('typing-inspect', '0.9.0'), ('typing_extensions', '4.7.1'), ('urllib3', '2.0.3'), ('watermark', '2.4.3'), ('wcwidth', '0.2.6'), ('werkzeug', '2.3.6'), ('wheel', '0.40.0'), ('yarl', '1.9.2'), ('zipp', '3.16.0')]
Who can help?
@hwchase17
Information
- [ ] The official example notebooks/scripts
- [ ] My own modified scripts
Related Components
- [ ] LLMs/Chat Models
- [X] Embedding Models
- [ ] Prompts / Prompt Templates / Prompt Selectors
- [ ] Output Parsers
- [ ] Document Loaders
- [X] Vector Stores / Retrievers
- [ ] Memory
- [ ] Agents / Agent Executors
- [ ] Tools / Toolkits
- [ ] Chains
- [ ] Callbacks/Tracing
- [ ] Async
Reproduction
target_query = 'What are the hyve rules?'
facts_docs = [
Document(page_content=f)
for f in [x.strip() for x in """
Under the banner of privacy, hyve empowers you to determine the visibility of your goals, providing you with options like Public (all hyve members can see your goal), Friends (only your trusted hyve connections can), and Private (for secret missions where you can personally invite the desired ones)
At hyve, we're all about protecting your details and your privacy, making sure everything stays safe and secure
The main goal of hyve is to provide you with the tools to reach your financial goals as quickly as possible, our motto is: "Get there faster!"
Resting as the sole financial community composed entirely of 100% verified real users, hyve assures that each user is genuine and verified, enhancing the safety of you and our community
Designed with privacy as a top priority, hyve puts the power in your hands to control exactly who you share your goals with
hyve prioritizes your personal data protection and privacy rights, using your data exclusively to expedite the achievement of your goals without sharing your information with any other parties, for more info please visit https://app.letshyve.com/privacy-policy
Being the master of your privacy and investment strategies, you have full control over your goal visibility, making hyve a perfect partner for your financial journey
The Round-Up Rule in hyve integrates savings into your daily habits by rounding up your everyday expenses, depositing the surplus into your savings goal, e.g. if you purchase a cup of coffee for $2.25, hyve rounds it up to $3, directing the $0.75 difference to your savings
The Automatic Rule in hyve enables our AI engine to analyze your income and spending habits, thereby determining how much you can safely save, so you don't have to worry about it
The Recurring Rule in hyve streamlines your savings by automatically transferring a specified amount to your savings on a set schedule, making saving as effortless as possible
The Matching Rule in hyve allows you to double your savings by having another user match every dollar you save towards a goal, creating a savings buddy experience
""".strip().split('\n')]
]
retriever = FAISS.from_documents(facts_docs, OpenAIEmbeddings())
docs = '\n'.join(d.page_content for d in retriever.similarity_search(target_query, k=10))
print(docs)
for a in ['Round-Up', 'Automatic', 'Recurring', 'Matching']:
assert a in docs, f'{a} not in docs'
Expected behavior
The words that contain most information above are hyve
and rule
, it should return the lines which define the Round-Up Rule in hyve
, Automatic Rule in hyve
, Recurring Rule in hyve
, Matching Rule in hyve
.
instead, the best 2 result it finds are:
At hyve, we're all about protecting your details and your privacy, making sure everything stays safe and secure
and
Under the banner of privacy, hyve empowers you to determine the visibility of your goals, providing you with options like Public (all hyve members can see your goal), Friends (only your trusted hyve connections can), and Private (for secret missions where you can personally invite the desired ones)
which don't even have the word rule
in them or have anything to do with rules.
The full list of results are:
At hyve, we're all about protecting your details and your privacy, making sure everything stays safe and secure
Under the banner of privacy, hyve empowers you to determine the visibility of your goals, providing you with options like Public (all hyve members can see your goal), Friends (only your trusted hyve connections can), and Private (for secret missions where you can personally invite the desired ones)
The Automatic Rule in hyve enables our AI engine to analyze your income and spending habits, thereby determining how much you can safely save, so you don't have to worry about it
Designed with privacy as a top priority, hyve puts the power in your hands to control exactly who you share your goals with
The main goal of hyve is to provide you with the tools to reach your financial goals as quickly as possible, our motto is: "Get there faster!"
Resting as the sole financial community composed entirely of 100% verified real users, hyve assures that each user is genuine and verified, enhancing the safety of you and our community
hyve prioritizes your personal data protection and privacy rights, using your data exclusively to expedite the achievement of your goals without sharing your information with any other parties, for more info please visit https://app.letshyve.com/privacy-policy
The Recurring Rule in hyve streamlines your savings by automatically transferring a specified amount to your savings on a set schedule, making saving as effortless as possible
The Matching Rule in hyve allows you to double your savings by having another user match every dollar you save towards a goal, creating a savings buddy experience
Being the master of your privacy and investment strategies, you have full control over your goal visibility, making hyve a perfect partner for your financial journey
which don't even include the Round-Up Rule in hyve
line in the top 10.
I've tried every open source VectorStore I could find (FAISS, Chrome, Annoy, DocArray, Qdrant, scikit-learn, etc), they all returned the exact same list. I also tried making everything lowercase (it did help with other queries, here it didn't). I also tried with relevancy score (getting 10x as many and sorting myself), which did help in other cases, but not here.
Any suggestion is welcome, especially if the error is on my side.
Thanks!
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('all-MiniLM-L6-v2')
facts_docs = [
f
for f in [x.strip() for x in """
Under the banner of privacy, hyve empowers you to determine the visibility of your goals, providing you with options like Public (all hyve members can see your goal), Friends (only your trusted hyve connections can), and Private (for secret missions where you can personally invite the desired ones)
At hyve, we're all about protecting your details and your privacy, making sure everything stays safe and secure
The main goal of hyve is to provide you with the tools to reach your financial goals as quickly as possible, our motto is: "Get there faster!"
Resting as the sole financial community composed entirely of 100% verified real users, hyve assures that each user is genuine and verified, enhancing the safety of you and our community
Designed with privacy as a top priority, hyve puts the power in your hands to control exactly who you share your goals with
hyve prioritizes your personal data protection and privacy rights, using your data exclusively to expedite the achievement of your goals without sharing your information with any other parties, for more info please visit https://app.letshyve.com/privacy-policy
Being the master of your privacy and investment strategies, you have full control over your goal visibility, making hyve a perfect partner for your financial journey
The Round-Up Rule in hyve integrates savings into your daily habits by rounding up your everyday expenses, depositing the surplus into your savings goal, e.g. if you purchase a cup of coffee for $2.25, hyve rounds it up to $3, directing the $0.75 difference to your savings
The Automatic Rule in hyve enables our AI engine to analyze your income and spending habits, thereby determining how much you can safely save, so you don't have to worry about it
The Recurring Rule in hyve streamlines your savings by automatically transferring a specified amount to your savings on a set schedule, making saving as effortless as possible
The Matching Rule in hyve allows you to double your savings by having another user match every dollar you save towards a goal, creating a savings buddy experience
""".strip().split('\n')]
]
# query
target_query = ['What are the hyve rules?']
#Compute embedding for both lists
embeddings1 = model.encode(target_query, convert_to_tensor=True)
embeddings2 = model.encode(facts_docs, convert_to_tensor=True)
#Compute cosine-similarities
cosine_scores = util.cos_sim(embeddings1, embeddings2)
import pandas as pd
# extract the scores into a list
scores = cosine_scores[0].cpu().numpy().tolist()
# create a dataframe
df = pd.DataFrame({"facts_doc": facts_docs, "cosine_similarity": scores})
df.sort_values('cosine_similarity', ascending=False)
Does this help to debug?
Hi @paplorinc, @hwchase17, and @Guidosalimbeni! As a test, I used the standalone chromadb
script with both the OpenAI embeddings and the default embeddings for chromadb, which are all-MiniLM-L6-v2
. I obtained the same results as when using langchain
wrappers, indicating that there is no bug. These outputs are valid. Furthermore, the results from the default embeddings (using standalone chromadb) align with @Guidosalimbeni's outputs, while the results from the OpenAI embeddings align with @paplorinc's outputs. Script to reproduce:
from chromadb.utils import embedding_functions
import os
import openai
client = chromadb.Client()
##### Default #####
collection = client.create_collection("sample_collection")
##### OpenAIEmbeddings #####
# OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
# openai_ef = embedding_functions.OpenAIEmbeddingFunction(
# api_key=OPENAI_API_KEY,
# model_name="text-embedding-ada-002"
# )
# collection = client.create_collection("sample_collection", embedding_function=openai_ef)
documents = [x.strip() for x in """
Under the banner of privacy, hyve empowers you to determine the visibility of your goals, providing you with options like Public (all hyve members can see your goal), Friends (only your trusted hyve connections can), and Private (for secret missions where you can personally invite the desired ones)
At hyve, we're all about protecting your details and your privacy, making sure everything stays safe and secure
The main goal of hyve is to provide you with the tools to reach your financial goals as quickly as possible, our motto is: "Get there faster!"
Resting as the sole financial community composed entirely of 100% verified real users, hyve assures that each user is genuine and verified, enhancing the safety of you and our community
Designed with privacy as a top priority, hyve puts the power in your hands to control exactly who you share your goals with
hyve prioritizes your personal data protection and privacy rights, using your data exclusively to expedite the achievement of your goals without sharing your information with any other parties, for more info please visit https://app.letshyve.com/privacy-policy
Being the master of your privacy and investment strategies, you have full control over your goal visibility, making hyve a perfect partner for your financial journey
The Round-Up Rule in hyve integrates savings into your daily habits by rounding up your everyday expenses, depositing the surplus into your savings goal, e.g. if you purchase a cup of coffee for $2.25, hyve rounds it up to $3, directing the $0.75 difference to your savings
The Automatic Rule in hyve enables our AI engine to analyze your income and spending habits, thereby determining how much you can safely save, so you don't have to worry about it
The Recurring Rule in hyve streamlines your savings by automatically transferring a specified amount to your savings on a set schedule, making saving as effortless as possible
The Matching Rule in hyve allows you to double your savings by having another user match every dollar you save towards a goal, creating a savings buddy experience
""".strip().split('\n')]
ids = [f"doc{i}" for i in range(1, len(documents) + 1)]
collection.add(
documents=documents,
ids=ids,
)
results = collection.query(
query_texts=["What are the hyve rules?"],
n_results=10,
)
Thanks a lot guys for checking, appreciate it! So the culprit for the mismatched expectations was the OpenAI embeddings - I wonder why the direct Chroma way works so much better!
Thanks a lot, I can use the second version instead of the solution I had!
@Bearnardd, @Guidosalimbeni is there a way for me to tip you guys for your help?
@paplorinc frankly I have no idea why OpenAI embeddings work so much worse. To be honest I would assume they would be way better than the default ones.
@Bearnardd, @Guidosalimbeni is there a way for me to tip you guys for your help?
@paplorinc Thank you for the kind offer but in open source community, we are helping each other just for the sake of helping :)
Let me return the favor somehow, you guys were really helpful!
These searches are working a lot better now, just a note that all-MiniLM-L6-v2
seems to require a lot more memory, the pod was suddenly crashing with OOM.
Hi, @paplorinc! I'm Dosu, and I'm helping the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.
From what I understand, the issue you reported was about the similarity search returning random documents instead of the ones that contain the specified keywords. @Guidosalimbeni provided a code snippet to debug the issue, and @Bearnardd tested the standalone chromadb
script and found that the outputs are valid, indicating that there is no bug. You thanked everyone for their help and mentioned that the OpenAI embeddings were the cause of the mismatched expectations.
I wanted to check with you if this issue is still relevant to the latest version of the LangChain repository. If it is, please let the LangChain team know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days.
Thank you for your understanding, and please don't hesitate to reach out if you have any further questions or concerns.