chroma icon indicating copy to clipboard operation
chroma copied to clipboard

[Feature Request]: autogenerate id if not supplied

Open swyxio opened this issue 2 years ago • 3 comments

Describe the problem

i often dont care about the id of my documents, i just want to throw as much into chroma as possible and only query it by vector search.

right now chroma forces me to supply an id, (so then i have to go remember how to generate an id, its easy in python fortunately but less so in js) and sometimes i screw this up:

  # can you spot the bug?
  import uuid
  for result in results["organic_results"]:
    print("scraping", result["link"])
    content = bs4_scrape(result["link"])
    collection.add(ids=[uuid.uuid4()],
                documents=[content],
                metadatas=[{
                  "snippet": result["snippet"],
                  "link": result["link"],
                  "title": result["title"],
                }])

can you spot the bug? i didn't do str() on my uuid.

ValueError(f"Expected ID to be a str, got {id}")

all of this has nothing to do with what i wanted in the first place.

Describe the proposed solution

make id optional, autogenerate uuid if not supplied

Alternatives considered

No response

Importance

nice to have

Additional Information

No response

swyxio avatar Apr 26 '23 21:04 swyxio

we probably use id for indexing (?) but would be interesting to deprecate the id field completely and just treat it like metadata. shrinks the api surface area.

swyxio avatar Apr 26 '23 21:04 swyxio

I think it would be great to have the possibility for an auto increment ID. This way the user doesn't have to worry about that.

Msa360 avatar Apr 28 '23 00:04 Msa360

Sorry if this is an obvious or silly question, but it is that much of a performance hit to simply get the total count in a collection, increase it by one, and use that of the ID any time you are storing?

Am I missing something here?

Electrofried avatar May 01 '23 14:05 Electrofried

usual reason why databases avoid autoincrement id (which is not my suggestion) is that it doesnt scale when distributed (need global lock, or too closely colocates data on disk, etc)

swyxio avatar May 08 '23 01:05 swyxio

Would be very neat, I just stumbled over that :)

catbears avatar Jun 28 '23 06:06 catbears

@catbears sorry about that! our theory with ids is that it helps the developer think about bookkeeping...and avoid the sharp edges of not thinking about it and wishing one had! But tools like langchain just generated a uuid.uuidv1()... and it's definitely a strange property of a database to force it.

jeffchuber avatar Jun 28 '23 06:06 jeffchuber

perhaps pipelines can solve this? autorandom?

jeffchuber avatar Sep 13 '23 21:09 jeffchuber

Discussed offline, we believe that forcing users to bookkeep ids leads to less second-order issues.

HammadB avatar Nov 07 '23 18:11 HammadB