cancer-data
cancer-data copied to clipboard
Persistent storage of matrices that enables quick indexed lookup
Currently, we're storing our datasets (which are matrices) as compressed TSVs which are great for long-term interoperable storage. However, we'd love a way to lookup specific rows and columns without having to read the entire dataset. We began discussing options at https://github.com/cognoma/cognoma/issues/17#issuecomment-233149110. We want a persistent storage format (i.e. file) that allows reading only specified rows and columns into a numpy array/matrix or a pandas dataframe.
A primary benchmark for judging implementations is how much time are you saving over reading in the entire bzipped TSV into python via pandas
for a variety of setups.