Does Chroma supports multi replica deployment options at scale ? and what about index types ?

Open agandra30 opened this issue 1 year ago • 0 comments

We are planing to try and test Chorma Vector DB for a large corpus datasets can go above 100M vectors - 500M and eventually to 1B vectors . At this scale i am looking at deployment options for production level distributed cluster style of deployment for Chroma . From the document official i could only see a docker image that would could install and host it on the servers [It is in alpha state as per the official documentation]. https://cookbook.chromadb.dev/running/deployment-patterns/. Some of the links are not working , can we have the replicas of the pods and will it work. ?

what about load balancing and things like that is there any binary or any other formats we could look at like helm charts or something that we can try it.

Also my assumption is Chroma is mostly a embedabble DB and wanted to know its capabilities in multimodal datasets , does it store the actual image in itself as encoded format ?

We are focused on using 2 Index types : HNSW and DISKANN , does Chorma supports DISKANN ? what all index types it could support and do we have any leverage if we host it on GPU based machines for faster index creation and bulk insert. ?

Need some understanding on how to manage the storage for Chroma , where can we set the parameters on self hosted instance to use a specific volume (block) , or an NFS share or can we provide an S3 object. ?

Did we have any batch limitations and can only push this much at one go. ?

Any inputs is highly appreciated

Sep 27 '24 06:09 agandra30