cellxgene-census icon indicating copy to clipboard operation
cellxgene-census copied to clipboard

Census cell similarity search: pipeline to build TileDB Vector Indexes of cell embeddings

Open mlin opened this issue 2 months ago • 0 comments

Develop a productionizable pipeline to build the indexes for TileDB-Vector-Search from the stored Census embeddings (starting with scVI but also UCE, Geneformer, etc.). This consists of some Python code to read the embeddings sparse arrays and build the indexes (which are themselves TileDB arrays), then packaged up for cloud deployment. It's expected to take a few hours for each set of embeddings, and the different sets of embeddings can be processed in parallel.

Unless suggested otherwise, I'll package this as a dockerized WDL pipeline since that's most familiar to me (@mlin).

mlin avatar Apr 25 '24 00:04 mlin