PyVCF Consider using pandas as the in memory data store

Please see: http://nbviewer.ipython.org/4678244/

Advantages are the ability to do many fast numpy based operations, output to many different formats (hdf5, csv, excel).

Requires we rewrite the parser quite extensively, ideally to output numpy structures.

I'm very tempted.

Jan 30 '13 23:01 jamescasbon

Any sense of how the memory footprint would scale with pandas?

Jan 30 '13 23:01 arq5x

I really like this idea! However, Aaron raises a valid concern.

I think pandas really needs everything in memory (I don't know of a way to use numpy.memmap as a backend, but I also think it wouldn't really help us here). This is of course not an option for very large files (e.g. 1KG files can easily be > 25G) and certainly if you just want to parse such files to store the data elsewhere you don't gain anything by loading it all in one datastructure first.

To make things worse, I don't know of a good way (aside from very inefficient pandas.concat) to construct a DataFrame iteratively. James' example reads everything in memory first and I see no way around that (but I'm no pandas expert). See e.g. pydata/pandas#2305.

Perhaps another way of going at this is to make it easy to get your data into pandas, without using it as our backend.

Jan 31 '13 09:01 martijnvermaat

This feels like a layer on top of raw parsing? The notebook shows some nice examples of the power and simplicity of the pandas framework, though.

The lack of a fast iterative DataFrame is a characteristic shared with R. In Bioconductor, we use chunked iterators, but I personally find the paradigm a bit clunky at times.

Jan 31 '13 12:01 seandavi

Here is my current thinking. We don't support fetching less than a single line of vcf, and I don think we need too. So perhaps when we parse a line we can return a numpy array where the ij element is the data from sample i format j, which may itself be an array. This is probably the lowest possible memory use from representing that data. Existing method access to the data can be a view on the array. We would then offer an API that can stitch these blocks together into a pandas data frame.

Issues:

1 the array is actually ragged, in that a vcf entry may not need to define all the formats in the format field. This can be achieved with masked arrays, but is more complex and less natural.

What dtype for the GT entry?
Variable length arrays in the format. Maybe the object dtype can be used here.
You can't pass an array that has nested arrays straight into pandas. It would need to be reprocessed into several blocks.

Feb 14 '13 08:02 jamescasbon

Right now we move everything in pandas from the VCF and a similar way that @jamescasbon propose above. The memory goes up, quite quickly, however we didn't try to pick a dtype for the object, leaving pandas to guess whatever was closer

I was wondering if there is any kind of news here?

Dec 16 '14 10:12 mattions

If there is any continued interest in this, it is possible to create a "lazy" dataframe by using chunking: http://pandas.pydata.org/pandas-docs/dev/io.html#iterating-through-files-chunk-by-chunk

An example applied to VCF files: http://nbviewer.ipython.org/github/erscott/pandasVCF/blob/master/ipynb/single_sample_ex.ipynb

May 04 '15 15:05 averagehat

I like the VCF chunk idea. I think these are worth exploring.

May 27 '15 10:05 mattions

+1 for pandas or dask. There are partial implementations of these in scikit-allel, for reference.

Here's another approach that assumes the 1 or 2 samples of interest are known up front, then builds a pandas DataFrame from the limited fields of interest using PyVCF.

Nov 18 '16 00:11 etal

Just passing by... looks like one can give pandas a memmapped array behind its handy functionality. I'll just leave this here: http://hilpisch.com/TPQ_Out_of_Memory_Analytics.html#/5
Goodbye and good luck!

Dec 13 '16 23:12 RGD2

Load and save pandas data with memmapping: https://gist.github.com/luispedro/7887214 Comprehensive guide to 'out of memory analytics' with pandas using a memory mapped file-backed array: http://hilpisch.com/TPQ_Out_of_Memory_Analytics.html#/5

Dec 13 '16 23:12 RGD2

PyVCF PyVCF copied to clipboard

Consider using pandas as the in memory data store

PyVCF
PyVCF copied to clipboard