langchain icon indicating copy to clipboard operation
langchain copied to clipboard

DOC: Code/twitter-the-algorithm-analysis-deeplake not working as written

Open casWVU opened this issue 1 year ago • 3 comments

Issue with current documentation:

I followed the documentation @ https://python.langchain.com/docs/use_cases/code/twitter-the-algorithm-analysis-deeplake.

I replaced 'twitter-the-algorithm' with another code base I'm analyzing and used my own credentials from OpenAI and Deep Lake.

When I run the code (on VS Code for Mac with M1 chip), I get the following error:

_ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (1435,) + inhomogeneous part.

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/Users/catherineswope/Desktop/LangChain/fromLangChain.py", line 37, in db.add_documents(texts) File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/langchain/vectorstores/base.py", line 91, in add_documents return self.add_texts(texts, metadatas, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/langchain/vectorstores/deeplake.py", line 184, in add_texts return self.vectorstore.add( ^^^^^^^^^^^^^^^^^^^^^ File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/deeplake/core/vectorstore/deeplake_vectorstore.py", line 271, in add dataset_utils.extend_or_ingest_dataset( File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/deeplake/core/vectorstore/vector_search/dataset/dataset.py", line 409, in extend_or_ingest_dataset raise IncorrectEmbeddingShapeError() deeplake.util.exceptions.IncorrectEmbeddingShapeError: The embedding function returned embeddings of different shapes. Please either use different embedding function or exclude invalid files that are not supported by the embedding function._

This is the code snippet from my actual code:

import os import getpass

from langchain.embeddings.openai import OpenAIEmbeddings from langchain.vectorstores import DeepLake from langchain.document_loaders import TextLoader

#get OPENAI API KEY and ACTIVELOOP_TOKEN os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:") os.environ["ACTIVELOOP_TOKEN"] = getpass.getpass("Activeloop Token:")

embeddings = OpenAIEmbeddings(disallowed_special=())

#clone from chattydocs git hub repo removedcomments branch and copy/paste path root_dir = "/Users/catherineswope/chattydocs/incubator-baremaps-0.7.1-removedcomments" docs = [] for dirpath, dirnames, filenames in os.walk(root_dir): for file in filenames: try: loader = TextLoader(os.path.join(dirpath, file), encoding="utf-8") docs.extend(loader.load_and_split()) except Exception as e: pass

from langchain.text_splitter import CharacterTextSplitter

text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0) texts = text_splitter.split_documents(docs)

username = "caswvu" # replace with your username from app.activeloop.ai db = DeepLake( dataset_path=f"hub://caswvu/baremaps", embedding_function=embeddings, ) db.add_documents(texts)

db = DeepLake( dataset_path="hub://caswvu/baremaps", read_only=True, embedding_function=embeddings, )

retriever = db.as_retriever() retriever.search_kwargs["distance_metric"] = "cos" retriever.search_kwargs["fetch_k"] = 100 retriever.search_kwargs["maximal_marginal_relevance"] = True retriever.search_kwargs["k"] = 10

from langchain.chat_models import ChatOpenAI from langchain.chains import ConversationalRetrievalChain

model = ChatOpenAI(model_name="gpt-3.5-turbo") # switch to 'gpt-4' qa = ConversationalRetrievalChain.from_llm(model, retriever=retriever)

questions = [ "What does this code do?", ] chat_history = []

for question in questions: result = qa({"question": question, "chat_history": chat_history}) chat_history.append((question, result["answer"])) print(f"-> Question: {question} \n") print(f"Answer: {result['answer']} \n")

Idea or request for content:

Can you please help me understand how to fix the code to address the error message? Also, if applicable, address in the documentation so that others can avoid as well. Thank you!

casWVU avatar Jul 09 '23 15:07 casWVU

@casWVU What version of LangChain are you on?

devstein avatar Jul 09 '23 16:07 devstein

Hi. I'm on

Name: langchain

Version: 0.0.223

Thanks!

On Sun, Jul 9, 2023 at 12:04 PM Devin Stein @.***> wrote:

@casWVU https://github.com/casWVU What version of LangChain are you on?

— Reply to this email directly, view it on GitHub https://github.com/hwchase17/langchain/issues/7435#issuecomment-1627759343, or unsubscribe https://github.com/notifications/unsubscribe-auth/A3YSMTRYHKLMXD22OLAKVPTXPLJAXANCNFSM6AAAAAA2DSOAYI . You are receiving this because you were mentioned.Message ID: @.***>

casWVU avatar Jul 09 '23 23:07 casWVU

Hey @casWVU! what DeepLake version are you using? This problem is related to the documents stored in the folder. Could you pls filter the files that you don't use. When files of unsupported format comes inside of the OpenAI embedding it sends back an empty list. Appending this empty list is causing the issue. In the newer version of the deeplake, the exception should provide you more details, but overall that's the issue.

adolkhan avatar Jul 11 '23 08:07 adolkhan

Hi! I'm on the latest version:

Name: deeplake

Version: 3.6.8

Thanks so much for the insight. Do you know which file types aren't supported by OpenAI embeddings? I'm reading OpenAI documentation and searching the web but not finding anything.

On Tue, Jul 11, 2023 at 4:07 AM Adilkhan Sarsen @.***> wrote:

Hey @casWVU https://github.com/casWVU! what DeepLake version are you using? This problem is related to the documents stored in the folder. Could you pls filter the files that you don't use. When files of unsupported format comes inside of the OpenAI embedding it sends back an empty list. Appending this empty list is causing the issue. In the newer version of the deeplake, the exception should provide you more details, but overall that's the issue.

— Reply to this email directly, view it on GitHub https://github.com/hwchase17/langchain/issues/7435#issuecomment-1630351570, or unsubscribe https://github.com/notifications/unsubscribe-auth/A3YSMTVLBRIMQBKFXADETWTXPUCVBANCNFSM6AAAAAA2DSOAYI . You are receiving this because you were mentioned.Message ID: @.***>

casWVU avatar Jul 11 '23 23:07 casWVU

Not sure which files are not supported, but I had faced similar issues before. I would suggest excluding files that doesn't bring any context info to the model, let's say like: .lock or .DS_Store

adolkhan avatar Jul 12 '23 06:07 adolkhan

Btw I saw that LangChain has updated the open ai related code, so now it should raise this kind of exception:

    raise openai.error.APIError("OpenAI API returned an empty embedding")
openai.error.APIError: OpenAI API returned an empty embedding

Just update langchain in you repo till the latest version

adolkhan avatar Jul 12 '23 07:07 adolkhan

Thanks, I'll check it out.

On Wed, Jul 12, 2023 at 3:20 AM Adilkhan Sarsen @.***> wrote:

But I saw that LangChain has updated the open ai related code, so now it should raise this kind of exception:

raise openai.error.APIError("OpenAI API returned an empty embedding")

openai.error.APIError: OpenAI API returned an empty embedding

Just update langchain in you repo till the latest version

— Reply to this email directly, view it on GitHub https://github.com/hwchase17/langchain/issues/7435#issuecomment-1631982346, or unsubscribe https://github.com/notifications/unsubscribe-auth/A3YSMTXSP7OKVDH4OLSTADLXPZF5XANCNFSM6AAAAAA2DSOAYI . You are receiving this because you were mentioned.Message ID: @.***>

casWVU avatar Jul 13 '23 02:07 casWVU

Hi, @casWVU! I'm Dosu, and I'm helping the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.

Based on my understanding, you were experiencing an error when running the code provided in the documentation. It seems that the error message indicated an issue with the shape of the embeddings returned by the embedding function. You received assistance from "devstein" and "adolkhan" who asked for the versions of "LangChain" and "DeepLake" being used. "adolkhan" suggested filtering out unsupported file types and updating "LangChain" to the latest version, which now raises a more informative exception.

Before we close this issue, we wanted to check if it is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on this issue. Otherwise, feel free to close the issue yourself or it will be automatically closed in 7 days.

Thank you for your understanding and contribution to the LangChain project!

dosubot[bot] avatar Oct 12 '23 16:10 dosubot[bot]