langchain
langchain copied to clipboard
Chroma (duckdb+parquet) DB isn't saving the parquet files for persisted DB until application is killed
System Info
Chroma v0.2.36, python 3.9
Who can help?
No response
Information
- [ ] The official example notebooks/scripts
- [ ] My own modified scripts
Related Components
- [ ] LLMs/Chat Models
- [ ] Embedding Models
- [ ] Prompts / Prompt Templates / Prompt Selectors
- [ ] Output Parsers
- [ ] Document Loaders
- [X] Vector Stores / Retrievers
- [ ] Memory
- [ ] Agents / Agent Executors
- [ ] Tools / Toolkits
- [ ] Chains
- [ ] Callbacks/Tracing
- [ ] Async
Reproduction
` settings = Settings( chroma_db_impl='duckdb+parquet', persist_directory="db", anonymized_telemetry=False )
pages = self._load_single_document(file_path=file_path)
docs = text_splitter.split_documents(pages)
db = Chroma.from_documents(docs, embedding_function, client_settings=settings)
`
Expected behavior
All files for the database should be created in the db directory. However, the parquet files are not being created when db.from_documents() is called. However, this is not happening. All the files in the index directory are created at this time.
Later, when the application is killed (flask application, so in between requests the DB is torn down and thus should be persisited), then the parquet files show up.
Answer generated by a 🤖
Answer
The issue seems to be related to the persistence of the database. In the provided code, the persist()
method is called when the object is destroyed. However, in the context of a Flask application, the object might not be destroyed until the application is killed, which is why the parquet files are only appearing at that time.
To solve this issue, you could explicitly call the persist()
method after creating the database. This should force the database to write the parquet files immediately.
Here is how you can modify your code:
settings = Settings(
chroma_db_impl='duckdb+parquet',
persist_directory="db",
anonymized_telemetry=False
)
pages = self._load_single_document(file_path=file_path)
docs = text_splitter.split_documents(pages)
db = Chroma.from_documents(docs, embedding_function, client_settings=settings)
# Explicitly persist the database
db.persist()
Please note that this solution assumes that the db
object in your code is an instance of a class that has the persist()
method, as shown in the provided context. If this is not the case, you might need to adjust the code accordingly.
Sources
This response is meant to be useful, save you time, and share context. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.
It is a good practice to explicitly persist
# Explicitl persist
db.persist()
Thank you. I added an explicit persist. Here's what I have debugging through the code:
db = Chroma.from_documents(...) # After this line, I see the index folder created, but no parquet files.
db.persist() gives the following error: ValueError: You must specify a persist_directory oncreation to persist the collection. But everything is being added to my persist directory, 'db'. When the application is killed, the parquet files show up in my specified persist directory.
That seems like a bug, definitely not expected behaviour
Okay, seemingly, I need to call out a persist directory regardless if it is in the client Settings object or not. The following does NOT produce the error, and works as expected:
settings = Settings(chroma_db_impl='duckdb+parquet', persist_directory='db', anonymized_telemetry=False)
db = Chroma(embedding_function=embedding_function, client_settings=settings, persist_directory='db')
Sounds like a bug with Chroma, I'll post over on that project. Closing this
On second look, it appears the issue is in the Chroma wrapper. The following code in langchain/vectorstores/chroma.py seems to be the problem:
def __init__(
self,
collection_name: str = _LANGCHAIN_DEFAULT_COLLECTION_NAME,
embedding_function: Optional[Embeddings] = None,
persist_directory: Optional[str] = None,
client_settings: Optional[chromadb.config.Settings] = None,
collection_metadata: Optional[Dict] = None,
client: Optional[chromadb.Client] = None,
) -> None:
"""Initialize with Chroma client."""
try:
import chromadb
import chromadb.config
except ImportError:
raise ValueError(
"Could not import chromadb python package. "
"Please install it with `pip install chromadb`."
)
if client is not None:
self._client = client
else:
if client_settings:
self._client_settings = client_settings
else:
self._client_settings = chromadb.config.Settings()
if persist_directory is not None:
self._client_settings = chromadb.config.Settings(
chroma_db_impl="duckdb+parquet",
persist_directory=persist_directory,
)
self._client = chromadb.Client(self._client_settings)
self._embedding_function = embedding_function
**self._persist_directory = persist_directory**
self._collection = self._client.get_or_create_collection(
name=collection_name,
embedding_function=self._embedding_function.embed_documents
if self._embedding_function is not None
else None,
metadata=collection_metadata,
)
The bolded line indicates that the persist directory is only being set if that is in fact passed in, where as if it is used in the client.Settings object, I'd expect it to get unpacked and used. Is this expected behaviour?
@JackLeick I did some investigation around this and feel your issue is valid. Following is the summary. Do let me know if it addresses your problem statement.
Issue the persist_directory is not being set when it's passed in the client_settings object. This is likely because the persist_directory is only being set if it's passed as a separate argument to the Chroma constructor, and not when it's included in the client_settings object.
Code Fix
def __init__(
self,
collection_name: str = _LANGCHAIN_DEFAULT_COLLECTION_NAME,
embedding_function: Optional[Embeddings] = None,
persist_directory: Optional[str] = None,
client_settings: Optional[chromadb.config.Settings] = None,
collection_metadata: Optional[Dict] = None,
client: Optional[chromadb.Client] = None,
) -> None:
"""Initialize with Chroma client."""
try:
import chromadb
import chromadb.config
except ImportError:
raise ValueError(
"Could not import chromadb python package. "
"Please install it with `pip install chromadb`."
)
if client is not None:
self._client = client
else:
if client_settings:
self._client_settings = client_settings
self._persist_directory = client_settings.persist_directory if client_settings.persist_directory else persist_directory
'''
#updated to set _persist_directory either from client_settings.persist_directory or #persist_directory(inp args). if persist_directory(inp args) as per "def persist()" must specify a #persist_directory which still holds good as per the original design
'''
else:
self._client_settings = chromadb.config.Settings()
if persist_directory is not None:
self._client_settings = chromadb.config.Settings(
chroma_db_impl="duckdb+parquet",
persist_directory=persist_directory,
)
self._persist_directory = persist_directory#it should be here
self._client = chromadb.Client(self._client_settings)
self._embedding_function = embedding_function
#self._persist_directory = persist_directory ###it should not be here
self._collection = self._client.get_or_create_collection(
name=collection_name,
embedding_function=self._embedding_function.embed_documents
if self._embedding_function is not None
else None,
metadata=collection_metadata,
)
Change Summary: The init method of the Chroma class has been updated to correctly handle the persist_directory parameter. Previously, persist_directory was only being set if it was passed as a separate argument to the Chroma constructor, and not when it was included in the client_settings object. The updated code now correctly sets persist_directory in all scenarios.
Addressed Scenarios:
-
persist_directory is provided in client_settings: If client_settings is provided and it includes persist_directory, then self._persist_directory is set to client_settings.persist_directory.
-
persist_directory is not provided in client_settings but is passed as an argument: If client_settings is provided but it does not include persist_directory, and persist_directory is passed as a separate argument, then self._persist_directory is set to the persist_directory argument.
-
client_settings is not provided, but persist_directory is passed as an argument: If client_settings is not provided and persist_directory is passed as a separate argument, then self._persist_directory is set to the persist_directory argument.
-
Neither client_settings nor persist_directory is provided: If neither client_settings nor persist_directory is provided, then self._persist_directory remains unset.
Thank you! This looks like a great fix. That will address my issue.
Issue has been addressed via commit : https://github.com/hwchase17/langchain/commit/a2830e3056e4e616160b150bf5ea212a97df2dc4
Hi, @JackLeick. I'm Dosu, and I'm helping the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.
From what I understand, the issue you reported was related to the Chroma wrapper not saving parquet files when using the duckdb+parquet database implementation. The problem was resolved by explicitly calling the persist()
method after creating the database. This issue was identified as a bug in the Chroma wrapper and has been fixed in a commit.
Before we close this issue, we wanted to check with you if it is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on this issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days.
Thank you for your contribution to the LangChain repository, and please let us know if you have any further questions or concerns.