cancer-data icon indicating copy to clipboard operation
cancer-data copied to clipboard

Persistent storage of matrices that enables quick indexed lookup

Open dhimmel opened this issue 7 years ago • 4 comments

Currently, we're storing our datasets (which are matrices) as compressed TSVs which are great for long-term interoperable storage. However, we'd love a way to lookup specific rows and columns without having to read the entire dataset. We began discussing options at https://github.com/cognoma/cognoma/issues/17#issuecomment-233149110. We want a persistent storage format (i.e. file) that allows reading only specified rows and columns into a numpy array/matrix or a pandas dataframe.

A primary benchmark for judging implementations is how much time are you saving over reading in the entire bzipped TSV into python via pandas for a variety of setups.

dhimmel avatar Jul 25 '16 21:07 dhimmel