chroma icon indicating copy to clipboard operation
chroma copied to clipboard

[Feature Request]: Langchain plugin for Chroma always tries to create the collection even if the collection already exists.

Open harshal-cuminai opened this issue 1 year ago • 12 comments

Describe the problem

Use Case: Only allow querying collection hosted in chroma server running remotely for similarity search. The assumption is that the triple (tenant, db, collection) will always exist and the client will always pass the right values that already exists in db. If not, we err out.

Problem: We are trying to integrate the Chroma db server into an application. We use chroma's langchain plugin for client side testing and wish to support client side integration with Langchain with limited access to chroma server.

chroma_client = chromadb.HttpClient(host='<chroma server host>', port=443, tenant="<tenant>", database="<db>", ssl=True)

db = Chroma(
    client=chroma_client,
    collection_name="demo",
    embedding_function=embedding_function,
)

retriever = db.as_retriever(search_kwargs={"k": 3})

The problem is that we don't want to expose all the api endpoints of chroma server and only are exposing the following in our app ingress rules:

  1. Get Tenant by Name
  2. Get Database by Name
  3. Get Collection by Name
  4. Query Collection (Note: We are not exposing Create Collection endpoint)

This works great when using pure chromadb way as shown below. Assuming that the collection "demo" is already created before. The code only uses the 4 api calls as mentioned above.

import chromadb
from chromadb.utils import embedding_functions
from chromadb import Settings

embedding_function = embedding_functions.SentenceTransformerEmbeddingFunction(model_name="all-MiniLM-L6-v2")
chroma_client = chromadb.HttpClient(host='<remote host name>', port=443, tenant="<tenant>", database="<db>", ssl=True)
demo_collection = chroma_client.get_collection(name="demo", embedding_function=embedding_function)

results = demo_collection.query(
    query_texts=["<query>"],
    n_results=2
)

However, by default the langchain plugin tries to create a collection by defaulting to get_or_create to true and thus errs out as we are not exposing the Collection create api.

Describe the proposed solution

We should have an option to set get_or_create to false.

db = Chroma(
    client=chroma_client,
    collection_name="demo",
    embedding_function=embedding_function,
    get_or_create=False,
)

Alternatives considered

No response

Importance

i cannot use Chroma without it

Additional Information

No response

harshal-cuminai avatar May 08 '24 07:05 harshal-cuminai

@jeffchuber @tazarov need help with this.

harshal-cuminai avatar May 08 '24 07:05 harshal-cuminai

@harshal-cuminai, thanks for the elaborate and deep exploration of the issue. Separating your ingestion and query/get flows makes sense for more than security reasons.

Just off the top of my head, I see two options here:

  • Small change in Langchain🦜🔗 as per the suggested approach or similar to it
  • Add auth to your API, thus rejecting anonymous (write requests)

Is auth something you can work with? If yes, then I can give you some configs to try out. It might be worth it until we figure out a more flexible solution.

tazarov avatar May 08 '24 09:05 tazarov

hi @tazarov sure we are open to any temporary solution till we can make some variant of proposed solution a first class integration in Langchain.

Currently we do have auth setup as a subprocess for the nginx proxy sitting in front of the chromadb service. But our use case requires rejecting collection creation altogether (even for authenticated clients) which is not possible due to current langchain integration, so i am thinking we will probably have to redirect POST call for collection creation (triggered by langchain) post authentication (based on client role) as follows:

Original: POST /collections
Modified Rewrite: GET /collections/<collection name>

as they both have same response schema and output when get_or_create is set to true as it is in current case.

What approach are you suggesting ?

harshal-cuminai avatar May 08 '24 10:05 harshal-cuminai

Rewriting sounds like a sensible approach. However, you'll have to read the POST payload to get the name attribute and then pass that to the GET. I think for NGINX, that translates to a bit of Lua scripting

tazarov avatar May 08 '24 10:05 tazarov

yes correct. Any cheaper alternative, you can suggest ?

On a side note, it would be best to have this as a first class feature in langchain-chroma. wdyt?

harshal-cuminai avatar May 08 '24 10:05 harshal-cuminai

I've already written up the Langchain🦜🔗 PR, just adding tests, and off it goes. However, it might take a few days to merge and release it. Your problem is not uncommon or shouldn't be for some publicly facing products where you'd want a modicum of control over who can write to the DB.

tazarov avatar May 08 '24 10:05 tazarov

@harshal-cuminai, PR in Langchain🦜🔗 created.

tazarov avatar May 08 '24 11:05 tazarov

@harshal-cuminai The PR should be in the next release.

tazarov avatar May 09 '24 15:05 tazarov

@tazarov is the package auto published on release? https://pypi.org/project/langchain-chroma/#history

harshal-cuminai avatar May 11 '24 09:05 harshal-cuminai

@harshal-cuminai, I think they do separate releases for partner libs. But you can always do the following:

With pip:

pip install git+https://github.com/langchain-ai/langchain.git@master#subdirectory=libs/partners/chroma

In requirements.txt:

git+https://github.com/langchain-ai/langchain.git@master#subdirectory=libs/partners/chroma

In pyproject.toml:

[tool.poetry.dependencies]
langchain-chroma = { git = "https://github.com/langchain-ai/langchain.git", branch = "master", subdirectory = "libs/partners/chroma" }

tazarov avatar May 11 '24 12:05 tazarov

perfect. this works. Thanks a ton @tazarov . Closing this thread now.

harshal-cuminai avatar May 11 '24 12:05 harshal-cuminai

@tazarov now that we have tested it locally, we are kinda blocked from release of our package till this change gets rolled out in the langchain-chroma package (as we can't rollout packages with direct repo based dependencies). I have dropped in a comment on your langchain PR, but is there a way you folks can expedite the release ?

harshal-cuminai avatar May 12 '24 12:05 harshal-cuminai

closing as 0.1.1 is released.

harshal-cuminai avatar May 16 '24 06:05 harshal-cuminai