langchain icon indicating copy to clipboard operation
langchain copied to clipboard

ChromaDB error when using HuggingFace Embeddings

Open juanps90 opened this issue 2 years ago • 5 comments

The following error appears at the end of the script

TypeError: 'NoneType' object is not callable
Exception ignored in: <function PersistentDuckDB.__del__ at 0x7f53e574d4c0>
Traceback (most recent call last):
  File ".../.local/lib/python3.9/site-packages/chromadb/db/duckdb.py", line 445, in __del__
AttributeError: 'NoneType' object has no attribute 'info'

... and comes up when doing:

embedding = HuggingFaceEmbeddings(model_name="hiiamsid/sentence_similarity_spanish_es")
docsearch = Chroma.from_documents(texts, embedding,persist_directory=persist_directory)

but doesn't happen with:

embedding = LlamaCppEmbeddings(model_path=path)

juanps90 avatar Apr 06 '23 19:04 juanps90

I suspect that we have encountered a bug, but fortunately, we have found a workaround to mitigate potential errors with ChromaDB.

https://github.com/hwchase17/langchain/issues/2491#issuecomment-1499274206

sergerdn avatar Apr 06 '23 19:04 sergerdn

Worked beautifully.

juanps90 avatar Apr 06 '23 20:04 juanps90

The source of the bug is that the del method https://github.com/chroma-core/chroma/blob/main/chromadb/db/duckdb.py#L444 is gettting called after other resources such as logger and os have already been deleted. You can call call chroma.persist() before exiting and your data will still be saved, but I don't see any easy way to fix the bug itself.

nagolinc avatar May 02 '23 22:05 nagolinc

Hello community, has this issue been resolved? or what's the workaround?

zhenghax avatar May 20 '23 04:05 zhenghax

@zhenghax

Hello community, has this issue been resolved? or what's the workaround?

I believe this has been fixed: https://github.com/chroma-core/chroma/issues/364

nagolinc avatar May 23 '23 18:05 nagolinc

I used "BAAI/bge-base-en" embedding and created succesfully a Chroma Database.


# Supplying a persist_directory will store the embeddings on disk
persist_directory = '/content/drive/MyDrive/db'

## Here is the new embeddings being used
embedding = model_norm  # "BAAI/bge-base-en"

# load a vector database from persist direvtory, pay attention to the parameter: embedding_function

vectordb = Chroma(persist_directory=persist_directory, embedding_function=embedding)

I try to use "collection" class:

collection = client.get_collection(name='langchain', embedding_function=embedding)

collection.count() # 467

But I am not successfull to add a record in the collection by using the code:

document="""

About the author

Arthur C. Brooks is an American social scientist, the William Henry
Bloomberg Professor of the Practice of Public Leadership at the
Harvard Kennedy School, and Professor of Management Practice at
the Harvard Business School. Prior, he was the president of the
American Enterprise Institute for ten years, where he held the Beth
and Ravenel Curry Chair in Free Enterprise. He has authored eleven
books, including the bestsellers Love Your Enemies and The
Conservative Heart, and writes the popular How to Build a Life
column at The Atlantic. He is also the host of the podcast The Art of
Happiness with Arthur Brooks.
"""

collection.add(
  documents=[document],
  metadatas=[{"page": 1, "source": "/content/drive/MyDrive/book/about_the_author.pdf"}],
  ids=["467"]
)

and an error:

TypeError Traceback (most recent call last) in <cell line: 17>() 15 """ 16 ---> 17 collection.add( 18 documents=[document], 19 metadatas=[{"page": 1, "source": "/content/drive/MyDrive/book/about_the_author.pdf"}],

1 frames /usr/local/lib/python3.10/dist-packages/chromadb/api/models/Collection.py in _validate_embedding_set(self, ids, embeddings, metadatas, documents, require_embeddings_or_documents) 380 "You must provide embeddings or a function to compute them" 381 ) --> 382 embeddings = self._embedding_function(documents) 383 384 # if embeddings is None:

TypeError: 'HuggingFaceBgeEmbeddings' object is not callable

dtthanh1971 avatar Aug 18 '23 02:08 dtthanh1971

Hi, @juanps90! I'm Dosu, and I'm helping the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.

Based on my understanding, the original issue was about a TypeError occurring when using HuggingFace Embeddings with ChromaDB. It seems that a workaround has been found to mitigate potential errors with ChromaDB, and a fix has been implemented. However, a new issue has been reported where a TypeError occurs when trying to add a record to a collection using the HuggingFaceBgeEmbeddings object.

Before we close this issue, we wanted to check with you if it is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days.

Thank you for your contribution to the LangChain repository!

dosubot[bot] avatar Nov 17 '23 16:11 dosubot[bot]

I used "BAAI/bge-base-en" embedding and created succesfully a Chroma Database.


# Supplying a persist_directory will store the embeddings on disk
persist_directory = '/content/drive/MyDrive/db'

## Here is the new embeddings being used
embedding = model_norm  # "BAAI/bge-base-en"

# load a vector database from persist direvtory, pay attention to the parameter: embedding_function

vectordb = Chroma(persist_directory=persist_directory, embedding_function=embedding)

I try to use "collection" class:

collection = client.get_collection(name='langchain', embedding_function=embedding)

collection.count() # 467

But I am not successfull to add a record in the collection by using the code:

document="""

About the author

Arthur C. Brooks is an American social scientist, the William Henry
Bloomberg Professor of the Practice of Public Leadership at the
Harvard Kennedy School, and Professor of Management Practice at
the Harvard Business School. Prior, he was the president of the
American Enterprise Institute for ten years, where he held the Beth
and Ravenel Curry Chair in Free Enterprise. He has authored eleven
books, including the bestsellers Love Your Enemies and The
Conservative Heart, and writes the popular How to Build a Life
column at The Atlantic. He is also the host of the podcast The Art of
Happiness with Arthur Brooks.
"""

collection.add(
  documents=[document],
  metadatas=[{"page": 1, "source": "/content/drive/MyDrive/book/about_the_author.pdf"}],
  ids=["467"]
)

and an error:

TypeError Traceback (most recent call last) in <cell line: 17>() 15 """ 16 ---> 17 collection.add( 18 documents=[document], 19 metadatas=[{"page": 1, "source": "/content/drive/MyDrive/book/about_the_author.pdf"}], 1 frames /usr/local/lib/python3.10/dist-packages/chromadb/api/models/Collection.py in _validate_embedding_set(self, ids, embeddings, metadatas, documents, require_embeddings_or_documents) 380 "You must provide embeddings or a function to compute them" 381 ) --> 382 embeddings = self._embedding_function(documents) 383 384 # if embeddings is None: TypeError: 'HuggingFaceBgeEmbeddings' object is not callable

Is there a solution for this? From reading their documentation, it seems you need an API key to use HuggingFaceEmbeddings with Chroma, but not when using LangChain's version of Chroma.

Ideally, I'd like to use open source embeddings models from HuggingFace without paying.

ccmilne avatar Dec 01 '23 17:12 ccmilne

I used "BAAI/bge-base-en" embedding and created succesfully a Chroma Database.


# Supplying a persist_directory will store the embeddings on disk
persist_directory = '/content/drive/MyDrive/db'

## Here is the new embeddings being used
embedding = model_norm  # "BAAI/bge-base-en"

# load a vector database from persist direvtory, pay attention to the parameter: embedding_function

vectordb = Chroma(persist_directory=persist_directory, embedding_function=embedding)

I try to use "collection" class:

collection = client.get_collection(name='langchain', embedding_function=embedding)

collection.count() # 467

But I am not successfull to add a record in the collection by using the code:

document="""

About the author

Arthur C. Brooks is an American social scientist, the William Henry
Bloomberg Professor of the Practice of Public Leadership at the
Harvard Kennedy School, and Professor of Management Practice at
the Harvard Business School. Prior, he was the president of the
American Enterprise Institute for ten years, where he held the Beth
and Ravenel Curry Chair in Free Enterprise. He has authored eleven
books, including the bestsellers Love Your Enemies and The
Conservative Heart, and writes the popular How to Build a Life
column at The Atlantic. He is also the host of the podcast The Art of
Happiness with Arthur Brooks.
"""

collection.add(
  documents=[document],
  metadatas=[{"page": 1, "source": "/content/drive/MyDrive/book/about_the_author.pdf"}],
  ids=["467"]
)

and an error:

TypeError Traceback (most recent call last) in <cell line: 17>() 15 """ 16 ---> 17 collection.add( 18 documents=[document], 19 metadatas=[{"page": 1, "source": "/content/drive/MyDrive/book/about_the_author.pdf"}], 1 frames /usr/local/lib/python3.10/dist-packages/chromadb/api/models/Collection.py in _validate_embedding_set(self, ids, embeddings, metadatas, documents, require_embeddings_or_documents) 380 "You must provide embeddings or a function to compute them" 381 ) --> 382 embeddings = self._embedding_function(documents) 383 384 # if embeddings is None: TypeError: 'HuggingFaceBgeEmbeddings' object is not callable

Is there a solution for this? From reading their documentation, it seems you need an API key to use HuggingFaceEmbeddings with Chroma, but not when using LangChain's version of Chroma.

Ideally, I'd like to use open source embeddings models from HuggingFace without paying.

did you find a good solution? Chroma.py only accepts Huggingface embeddings but I would rather use open source embeddings as well

ibozkurt79 avatar Dec 22 '23 04:12 ibozkurt79