chroma
chroma copied to clipboard
[Feature Request]: autogenerate id if not supplied
Describe the problem
i often dont care about the id of my documents, i just want to throw as much into chroma as possible and only query it by vector search.
right now chroma forces me to supply an id, (so then i have to go remember how to generate an id, its easy in python fortunately but less so in js) and sometimes i screw this up:
# can you spot the bug?
import uuid
for result in results["organic_results"]:
print("scraping", result["link"])
content = bs4_scrape(result["link"])
collection.add(ids=[uuid.uuid4()],
documents=[content],
metadatas=[{
"snippet": result["snippet"],
"link": result["link"],
"title": result["title"],
}])
can you spot the bug? i didn't do str() on my uuid.
ValueError(f"Expected ID to be a str, got {id}")
all of this has nothing to do with what i wanted in the first place.
Describe the proposed solution
make id optional, autogenerate uuid if not supplied
Alternatives considered
No response
Importance
nice to have
Additional Information
No response
we probably use id for indexing (?) but would be interesting to deprecate the id field completely and just treat it like metadata. shrinks the api surface area.
I think it would be great to have the possibility for an auto increment ID. This way the user doesn't have to worry about that.
Sorry if this is an obvious or silly question, but it is that much of a performance hit to simply get the total count in a collection, increase it by one, and use that of the ID any time you are storing?
Am I missing something here?
usual reason why databases avoid autoincrement id (which is not my suggestion) is that it doesnt scale when distributed (need global lock, or too closely colocates data on disk, etc)
Would be very neat, I just stumbled over that :)
@catbears sorry about that! our theory with ids is that it helps the developer think about bookkeeping...and avoid the sharp edges of not thinking about it and wishing one had! But tools like langchain just generated a uuid.uuidv1()... and it's definitely a strange property of a database to force it.
perhaps pipelines can solve this? autorandom?
Discussed offline, we believe that forcing users to bookkeep ids leads to less second-order issues.