PyVCF icon indicating copy to clipboard operation
PyVCF copied to clipboard

Consider using pandas as the in memory data store

Open jamescasbon opened this issue 12 years ago • 10 comments

Please see: http://nbviewer.ipython.org/4678244/

Advantages are the ability to do many fast numpy based operations, output to many different formats (hdf5, csv, excel).

Requires we rewrite the parser quite extensively, ideally to output numpy structures.

I'm very tempted.

jamescasbon avatar Jan 30 '13 23:01 jamescasbon

Any sense of how the memory footprint would scale with pandas?

arq5x avatar Jan 30 '13 23:01 arq5x

I really like this idea! However, Aaron raises a valid concern.

I think pandas really needs everything in memory (I don't know of a way to use numpy.memmap as a backend, but I also think it wouldn't really help us here). This is of course not an option for very large files (e.g. 1KG files can easily be > 25G) and certainly if you just want to parse such files to store the data elsewhere you don't gain anything by loading it all in one datastructure first.

To make things worse, I don't know of a good way (aside from very inefficient pandas.concat) to construct a DataFrame iteratively. James' example reads everything in memory first and I see no way around that (but I'm no pandas expert). See e.g. pydata/pandas#2305.

Perhaps another way of going at this is to make it easy to get your data into pandas, without using it as our backend.

martijnvermaat avatar Jan 31 '13 09:01 martijnvermaat

This feels like a layer on top of raw parsing? The notebook shows some nice examples of the power and simplicity of the pandas framework, though.

The lack of a fast iterative DataFrame is a characteristic shared with R. In Bioconductor, we use chunked iterators, but I personally find the paradigm a bit clunky at times.

seandavi avatar Jan 31 '13 12:01 seandavi

Here is my current thinking. We don't support fetching less than a single line of vcf, and I don think we need too. So perhaps when we parse a line we can return a numpy array where the ij element is the data from sample i format j, which may itself be an array. This is probably the lowest possible memory use from representing that data. Existing method access to the data can be a view on the array. We would then offer an API that can stitch these blocks together into a pandas data frame.

Issues:

1 the array is actually ragged, in that a vcf entry may not need to define all the formats in the format field. This can be achieved with masked arrays, but is more complex and less natural.

  1. What dtype for the GT entry?
  2. Variable length arrays in the format. Maybe the object dtype can be used here.
  3. You can't pass an array that has nested arrays straight into pandas. It would need to be reprocessed into several blocks.

jamescasbon avatar Feb 14 '13 08:02 jamescasbon

Right now we move everything in pandas from the VCF and a similar way that @jamescasbon propose above. The memory goes up, quite quickly, however we didn't try to pick a dtype for the object, leaving pandas to guess whatever was closer

I was wondering if there is any kind of news here?

mattions avatar Dec 16 '14 10:12 mattions

If there is any continued interest in this, it is possible to create a "lazy" dataframe by using chunking: http://pandas.pydata.org/pandas-docs/dev/io.html#iterating-through-files-chunk-by-chunk

An example applied to VCF files: http://nbviewer.ipython.org/github/erscott/pandasVCF/blob/master/ipynb/single_sample_ex.ipynb

averagehat avatar May 04 '15 15:05 averagehat

I like the VCF chunk idea. I think these are worth exploring.

mattions avatar May 27 '15 10:05 mattions

+1 for pandas or dask. There are partial implementations of these in scikit-allel, for reference.

Here's another approach that assumes the 1 or 2 samples of interest are known up front, then builds a pandas DataFrame from the limited fields of interest using PyVCF.

etal avatar Nov 18 '16 00:11 etal

Just passing by... looks like one can give pandas a memmapped array behind its handy functionality. I'll just leave this here: http://hilpisch.com/TPQ_Out_of_Memory_Analytics.html#/5
Goodbye and good luck!

RGD2 avatar Dec 13 '16 23:12 RGD2

Load and save pandas data with memmapping: https://gist.github.com/luispedro/7887214 Comprehensive guide to 'out of memory analytics' with pandas using a memory mapped file-backed array: http://hilpisch.com/TPQ_Out_of_Memory_Analytics.html#/5

RGD2 avatar Dec 13 '16 23:12 RGD2