langchain icon indicating copy to clipboard operation
langchain copied to clipboard

Similarity search returns random docs, not the ones that contain the specified keywords

Open l0rinc opened this issue 1 year ago • 7 comments

System Info (M1 mac)

Python implementation: CPython Python version : 3.11.4 IPython version : 8.14.0

Compiler : GCC 12.2.0 OS : Linux Release : 5.15.49-linuxkit-pr Machine : aarch64 Processor : CPU cores : 5 Architecture: 64bit

[('aiohttp', '3.8.4'), ('aiosignal', '1.3.1'), ('asttokens', '2.2.1'), ('async-timeout', '4.0.2'), ('attrs', '23.1.0'), ('backcall', '0.2.0'), ('blinker', '1.6.2'), ('certifi', '2023.5.7'), ('charset-normalizer', '3.2.0'), ('click', '8.1.4'), ('dataclasses-json', '0.5.9'), ('decorator', '5.1.1'), ('docarray', '0.35.0'), ('executing', '1.2.0'), ('faiss-cpu', '1.7.4'), ('flask', '2.3.2'), ('frozenlist', '1.3.3'), ('greenlet', '2.0.2'), ('idna', '3.4'), ('importlib-metadata', '6.8.0'), ('ipython', '8.14.0'), ('itsdangerous', '2.1.2'), ('jedi', '0.18.2'), ('jinja2', '3.1.2'), ('json5', '0.9.14'), ('langchain', '0.0.228'), ('langchainplus-sdk', '0.0.20'), ('markdown-it-py', '3.0.0'), ('markupsafe', '2.1.3'), ('marshmallow', '3.19.0'), ('marshmallow-enum', '1.5.1'), ('matplotlib-inline', '0.1.6'), ('mdurl', '0.1.2'), ('multidict', '6.0.4'), ('mypy-extensions', '1.0.0'), ('numexpr', '2.8.4'), ('numpy', '1.25.1'), ('openai', '0.27.8'), ('openapi-schema-pydantic', '1.2.4'), ('orjson', '3.9.2'), ('packaging', '23.1'), ('parso', '0.8.3'), ('pexpect', '4.8.0'), ('pickleshare', '0.7.5'), ('pip', '23.1.2'), ('prompt-toolkit', '3.0.39'), ('psycopg2-binary', '2.9.6'), ('ptyprocess', '0.7.0'), ('pure-eval', '0.2.2'), ('pydantic', '1.10.11'), ('pygments', '2.15.1'), ('python-dotenv', '1.0.0'), ('python-json-logger', '2.0.7'), ('pyyaml', '6.0'), ('regex', '2023.6.3'), ('requests', '2.31.0'), ('rich', '13.4.2'), ('setuptools', '65.5.1'), ('six', '1.16.0'), ('slack-bolt', '1.18.0'), ('slack-sdk', '3.21.3'), ('sqlalchemy', '2.0.18'), ('stack-data', '0.6.2'), ('tenacity', '8.2.2'), ('tiktoken', '0.4.0'), ('tqdm', '4.65.0'), ('traitlets', '5.9.0'), ('types-requests', '2.31.0.1'), ('types-urllib3', '1.26.25.13'), ('typing-inspect', '0.9.0'), ('typing_extensions', '4.7.1'), ('urllib3', '2.0.3'), ('watermark', '2.4.3'), ('wcwidth', '0.2.6'), ('werkzeug', '2.3.6'), ('wheel', '0.40.0'), ('yarl', '1.9.2'), ('zipp', '3.16.0')]

Who can help?

@hwchase17

Information

  • [ ] The official example notebooks/scripts
  • [ ] My own modified scripts

Related Components

  • [ ] LLMs/Chat Models
  • [X] Embedding Models
  • [ ] Prompts / Prompt Templates / Prompt Selectors
  • [ ] Output Parsers
  • [ ] Document Loaders
  • [X] Vector Stores / Retrievers
  • [ ] Memory
  • [ ] Agents / Agent Executors
  • [ ] Tools / Toolkits
  • [ ] Chains
  • [ ] Callbacks/Tracing
  • [ ] Async

Reproduction

target_query = 'What are the hyve rules?'
facts_docs = [
    Document(page_content=f)
    for f in [x.strip() for x in """
        Under the banner of privacy, hyve empowers you to determine the visibility of your goals, providing you with options like Public (all hyve members can see your goal), Friends (only your trusted hyve connections can), and Private (for secret missions where you can personally invite the desired ones)
        At hyve, we're all about protecting your details and your privacy, making sure everything stays safe and secure
        The main goal of hyve is to provide you with the tools to reach your financial goals as quickly as possible, our motto is: "Get there faster!"
        Resting as the sole financial community composed entirely of 100% verified real users, hyve assures that each user is genuine and verified, enhancing the safety of you and our community
        Designed with privacy as a top priority, hyve puts the power in your hands to control exactly who you share your goals with
        hyve prioritizes your personal data protection and privacy rights, using your data exclusively to expedite the achievement of your goals without sharing your information with any other parties, for more info please visit https://app.letshyve.com/privacy-policy
        Being the master of your privacy and investment strategies, you have full control over your goal visibility, making hyve a perfect partner for your financial journey
        The Round-Up Rule in hyve integrates savings into your daily habits by rounding up your everyday expenses, depositing the surplus into your savings goal, e.g. if you purchase a cup of coffee for $2.25, hyve rounds it up to $3, directing the $0.75 difference to your savings
        The Automatic Rule in hyve enables our AI engine to analyze your income and spending habits, thereby determining how much you can safely save, so you don't have to worry about it
        The Recurring Rule in hyve streamlines your savings by automatically transferring a specified amount to your savings on a set schedule, making saving as effortless as possible
        The Matching Rule in hyve allows you to double your savings by having another user match every dollar you save towards a goal, creating a savings buddy experience
    """.strip().split('\n')]
]
retriever = FAISS.from_documents(facts_docs, OpenAIEmbeddings())
docs = '\n'.join(d.page_content for d in retriever.similarity_search(target_query, k=10))
print(docs)
for a in ['Round-Up', 'Automatic', 'Recurring', 'Matching']:
    assert a in docs, f'{a} not in docs'

Expected behavior

The words that contain most information above are hyve and rule, it should return the lines which define the Round-Up Rule in hyve, Automatic Rule in hyve, Recurring Rule in hyve, Matching Rule in hyve.

instead, the best 2 result it finds are:

At hyve, we're all about protecting your details and your privacy, making sure everything stays safe and secure

and

Under the banner of privacy, hyve empowers you to determine the visibility of your goals, providing you with options like Public (all hyve members can see your goal), Friends (only your trusted hyve connections can), and Private (for secret missions where you can personally invite the desired ones)

which don't even have the word rule in them or have anything to do with rules.

The full list of results are:

At hyve, we're all about protecting your details and your privacy, making sure everything stays safe and secure
Under the banner of privacy, hyve empowers you to determine the visibility of your goals, providing you with options like Public (all hyve members can see your goal), Friends (only your trusted hyve connections can), and Private (for secret missions where you can personally invite the desired ones)
The Automatic Rule in hyve enables our AI engine to analyze your income and spending habits, thereby determining how much you can safely save, so you don't have to worry about it
Designed with privacy as a top priority, hyve puts the power in your hands to control exactly who you share your goals with
The main goal of hyve is to provide you with the tools to reach your financial goals as quickly as possible, our motto is: "Get there faster!"
Resting as the sole financial community composed entirely of 100% verified real users, hyve assures that each user is genuine and verified, enhancing the safety of you and our community
hyve prioritizes your personal data protection and privacy rights, using your data exclusively to expedite the achievement of your goals without sharing your information with any other parties, for more info please visit https://app.letshyve.com/privacy-policy
The Recurring Rule in hyve streamlines your savings by automatically transferring a specified amount to your savings on a set schedule, making saving as effortless as possible
The Matching Rule in hyve allows you to double your savings by having another user match every dollar you save towards a goal, creating a savings buddy experience
Being the master of your privacy and investment strategies, you have full control over your goal visibility, making hyve a perfect partner for your financial journey

which don't even include the Round-Up Rule in hyve line in the top 10.

I've tried every open source VectorStore I could find (FAISS, Chrome, Annoy, DocArray, Qdrant, scikit-learn, etc), they all returned the exact same list. I also tried making everything lowercase (it did help with other queries, here it didn't). I also tried with relevancy score (getting 10x as many and sorting myself), which did help in other cases, but not here.

Any suggestion is welcome, especially if the error is on my side.

Thanks!

l0rinc avatar Jul 09 '23 12:07 l0rinc

from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('all-MiniLM-L6-v2')


facts_docs = [
    f
    for f in [x.strip() for x in """
        Under the banner of privacy, hyve empowers you to determine the visibility of your goals, providing you with options like Public (all hyve members can see your goal), Friends (only your trusted hyve connections can), and Private (for secret missions where you can personally invite the desired ones)
        At hyve, we're all about protecting your details and your privacy, making sure everything stays safe and secure
        The main goal of hyve is to provide you with the tools to reach your financial goals as quickly as possible, our motto is: "Get there faster!"
        Resting as the sole financial community composed entirely of 100% verified real users, hyve assures that each user is genuine and verified, enhancing the safety of you and our community
        Designed with privacy as a top priority, hyve puts the power in your hands to control exactly who you share your goals with
        hyve prioritizes your personal data protection and privacy rights, using your data exclusively to expedite the achievement of your goals without sharing your information with any other parties, for more info please visit https://app.letshyve.com/privacy-policy
        Being the master of your privacy and investment strategies, you have full control over your goal visibility, making hyve a perfect partner for your financial journey
        The Round-Up Rule in hyve integrates savings into your daily habits by rounding up your everyday expenses, depositing the surplus into your savings goal, e.g. if you purchase a cup of coffee for $2.25, hyve rounds it up to $3, directing the $0.75 difference to your savings
        The Automatic Rule in hyve enables our AI engine to analyze your income and spending habits, thereby determining how much you can safely save, so you don't have to worry about it
        The Recurring Rule in hyve streamlines your savings by automatically transferring a specified amount to your savings on a set schedule, making saving as effortless as possible
        The Matching Rule in hyve allows you to double your savings by having another user match every dollar you save towards a goal, creating a savings buddy experience
    """.strip().split('\n')]
]

# query
target_query = ['What are the hyve rules?']

#Compute embedding for both lists
embeddings1 = model.encode(target_query, convert_to_tensor=True)
embeddings2 = model.encode(facts_docs, convert_to_tensor=True)

#Compute cosine-similarities
cosine_scores = util.cos_sim(embeddings1, embeddings2)

import pandas as pd

# extract the scores into a list
scores = cosine_scores[0].cpu().numpy().tolist()

# create a dataframe
df = pd.DataFrame({"facts_doc": facts_docs, "cosine_similarity": scores})

df.sort_values('cosine_similarity', ascending=False)
Screenshot 2023-07-09 at 17 04 01

Does this help to debug?

Guidosalimbeni avatar Jul 09 '23 16:07 Guidosalimbeni

Hi @paplorinc, @hwchase17, and @Guidosalimbeni! As a test, I used the standalone chromadb script with both the OpenAI embeddings and the default embeddings for chromadb, which are all-MiniLM-L6-v2. I obtained the same results as when using langchain wrappers, indicating that there is no bug. These outputs are valid. Furthermore, the results from the default embeddings (using standalone chromadb) align with @Guidosalimbeni's outputs, while the results from the OpenAI embeddings align with @paplorinc's outputs. Script to reproduce:

from chromadb.utils import embedding_functions
import os
import openai

client = chromadb.Client()

##### Default #####
collection = client.create_collection("sample_collection")

##### OpenAIEmbeddings #####
# OPENAI_API_KEY  = os.getenv("OPENAI_API_KEY")
# openai_ef = embedding_functions.OpenAIEmbeddingFunction(
#                 api_key=OPENAI_API_KEY,
#                 model_name="text-embedding-ada-002"
#             )
# collection = client.create_collection("sample_collection", embedding_function=openai_ef)

documents = [x.strip() for x in """
        Under the banner of privacy, hyve empowers you to determine the visibility of your goals, providing you with options like Public (all hyve members can see your goal), Friends (only your trusted hyve connections can), and Private (for secret missions where you can personally invite the desired ones)
        At hyve, we're all about protecting your details and your privacy, making sure everything stays safe and secure
        The main goal of hyve is to provide you with the tools to reach your financial goals as quickly as possible, our motto is: "Get there faster!"
        Resting as the sole financial community composed entirely of 100% verified real users, hyve assures that each user is genuine and verified, enhancing the safety of you and our community
        Designed with privacy as a top priority, hyve puts the power in your hands to control exactly who you share your goals with
        hyve prioritizes your personal data protection and privacy rights, using your data exclusively to expedite the achievement of your goals without sharing your information with any other parties, for more info please visit https://app.letshyve.com/privacy-policy
        Being the master of your privacy and investment strategies, you have full control over your goal visibility, making hyve a perfect partner for your financial journey
        The Round-Up Rule in hyve integrates savings into your daily habits by rounding up your everyday expenses, depositing the surplus into your savings goal, e.g. if you purchase a cup of coffee for $2.25, hyve rounds it up to $3, directing the $0.75 difference to your savings
        The Automatic Rule in hyve enables our AI engine to analyze your income and spending habits, thereby determining how much you can safely save, so you don't have to worry about it
        The Recurring Rule in hyve streamlines your savings by automatically transferring a specified amount to your savings on a set schedule, making saving as effortless as possible
        The Matching Rule in hyve allows you to double your savings by having another user match every dollar you save towards a goal, creating a savings buddy experience
    """.strip().split('\n')]

ids = [f"doc{i}" for i in range(1, len(documents) + 1)]

collection.add(
    documents=documents,
    ids=ids,
)

results = collection.query(
    query_texts=["What are the hyve rules?"],
    n_results=10,
) 

Bearnardd avatar Jul 09 '23 20:07 Bearnardd

Thanks a lot guys for checking, appreciate it! So the culprit for the mismatched expectations was the OpenAI embeddings - I wonder why the direct Chroma way works so much better!

Thanks a lot, I can use the second version instead of the solution I had!

l0rinc avatar Jul 09 '23 20:07 l0rinc

@Bearnardd, @Guidosalimbeni is there a way for me to tip you guys for your help?

l0rinc avatar Jul 09 '23 20:07 l0rinc

@paplorinc frankly I have no idea why OpenAI embeddings work so much worse. To be honest I would assume they would be way better than the default ones.

Bearnardd avatar Jul 09 '23 21:07 Bearnardd

@Bearnardd, @Guidosalimbeni is there a way for me to tip you guys for your help?

@paplorinc Thank you for the kind offer but in open source community, we are helping each other just for the sake of helping :)

Bearnardd avatar Jul 09 '23 21:07 Bearnardd

Let me return the favor somehow, you guys were really helpful!

l0rinc avatar Jul 09 '23 21:07 l0rinc

These searches are working a lot better now, just a note that all-MiniLM-L6-v2 seems to require a lot more memory, the pod was suddenly crashing with OOM.

l0rinc avatar Jul 14 '23 12:07 l0rinc

Hi, @paplorinc! I'm Dosu, and I'm helping the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.

From what I understand, the issue you reported was about the similarity search returning random documents instead of the ones that contain the specified keywords. @Guidosalimbeni provided a code snippet to debug the issue, and @Bearnardd tested the standalone chromadb script and found that the outputs are valid, indicating that there is no bug. You thanked everyone for their help and mentioned that the OpenAI embeddings were the cause of the mismatched expectations.

I wanted to check with you if this issue is still relevant to the latest version of the LangChain repository. If it is, please let the LangChain team know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days.

Thank you for your understanding, and please don't hesitate to reach out if you have any further questions or concerns.

dosubot[bot] avatar Oct 14 '23 20:10 dosubot[bot]