langchain icon indicating copy to clipboard operation
langchain copied to clipboard

IndexError: list index out of range when use Chroma.from_documents

Open fraywang opened this issue 1 year ago • 4 comments

System Info

Lang Chain 0.0.186 Mac OS Ventura Python 3.10

Who can help?

No response

Information

  • [ ] The official example notebooks/scripts
  • [X] My own modified scripts

Related Components

  • [ ] LLMs/Chat Models
  • [ ] Embedding Models
  • [ ] Prompts / Prompt Templates / Prompt Selectors
  • [ ] Output Parsers
  • [ ] Document Loaders
  • [X] Vector Stores / Retrievers
  • [ ] Memory
  • [ ] Agents / Agent Executors
  • [ ] Tools / Toolkits
  • [ ] Chains
  • [ ] Callbacks/Tracing
  • [ ] Async

Reproduction

why i got IndexError: list index out of range when use Chroma.from_documents

import os

from langchain.document_loaders import BiliBiliLoader from langchain.embeddings.openai import OpenAIEmbeddings from langchain.vectorstores import Chroma from langchain.text_splitter import RecursiveCharacterTextSplitter

os.environ["OPENAI_API_KEY"] = "***"

loader = BiliBiliLoader(["https://www.bilibili.com/video/BV18o4y137n1/"])

documents = loader.load()

text_splitter = RecursiveCharacterTextSplitter( chunk_size=1000, chunk_overlap=20 )

documents = text_splitter.split_documents(documents)

embeddings = OpenAIEmbeddings()

db = Chroma.from_documents(documents, embeddings, persist_directory="./db") db.persist()

Traceback (most recent call last): File "/bilibili/bilibili_embeddings.py", line 28, in db = Chroma.from_documents(documents, embeddings, persist_directory="./db") File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/langchain/vectorstores/chroma.py", line 422, in from_documents return cls.from_texts( File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/langchain/vectorstores/chroma.py", line 390, in from_texts chroma_collection.add_texts(texts=texts, metadatas=metadatas, ids=ids) File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/langchain/vectorstores/chroma.py", line 160, in add_texts self._collection.add( File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/chromadb/api/models/Collection.py", line 103, in add ids, embeddings, metadatas, documents = self._validate_embedding_set( File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/chromadb/api/models/Collection.py", line 354, in _validate_embedding_set ids = validate_ids(maybe_cast_one_to_many(ids)) File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/chromadb/api/types.py", line 82, in maybe_cast_one_to_many if isinstance(target[0], (int, float)): IndexError: list index out of range

Expected behavior

index gen succefully in the persist_directory

fraywang avatar May 31 '23 02:05 fraywang

Same error with loader = YoutubeLoader.from_youtube_url('https://www.youtube.com/watch?v=6qB1pYwIAlw')

fraywang avatar May 31 '23 03:05 fraywang

I'm having issues with the BiliBiliLoader when calling loader.load()

RuntimeError: This event loop is already running

hanifaudah avatar Jun 01 '23 04:06 hanifaudah

Having the same issue

iha2 avatar Jun 02 '23 02:06 iha2

I had the same issue and I noticed that I had not named my source directory consistently. I don't see where you specify the source directory, but that might be the issue.

inputcoffee avatar Jun 02 '23 16:06 inputcoffee

same problem with me, I set path and everything.

Wamy-Dev avatar Jun 10 '23 03:06 Wamy-Dev

I got this error when the length of the documents was 0

Try checking the contents of documents before loading into Chroma

acmoles avatar Jun 30 '23 10:06 acmoles

I got this error when the length of the documents was 0

Try checking the contents of documents before loading into Chroma

I get this error but I do not have the documents list empty. I was wondering if it is mandatory to have metadata for each document. For my use I do not currently need document metadata so I just ignore it.

mateiAvram avatar Jul 17 '23 08:07 mateiAvram

Try using embedding instead of embeddings (notice the s at the end). Example:

Chroma.from_documents(documents=texts, embedding=embedding_function, persist_directory=persist_directory)

Ahmad-Bunni avatar Sep 07 '23 16:09 Ahmad-Bunni

Hi, @fraywang,

I'm helping the LangChain team manage their backlog and am marking this issue as stale. It looks like you encountered an "IndexError: list index out of range" when using Chroma.from_documents in the Lang Chain library. There were several suggestions and code snippets provided by other users to troubleshoot the issue, but it seems that the problem remains unresolved.

Could you please confirm if this issue is still relevant to the latest version of the LangChain repository? If it is, please let the LangChain team know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days.

Thank you for your understanding and cooperation. If you have any further questions or updates, feel free to reach out.

dosubot[bot] avatar Dec 07 '23 16:12 dosubot[bot]

I am also getting the same error

Govindhkiruthi avatar Jul 27 '24 17:07 Govindhkiruthi