pandas2 icon indicating copy to clipboard operation
pandas2 copied to clipboard

Opening up "libpandas" for use by other projects

Open ibab opened this issue 8 years ago • 3 comments

I really like the proposal so far 👍 Are there any plans to provide a semi-stable C++/Cython API that could be used by other projects for things beyond simple pandas integration? (As in being able to write your own alternative to DataFrame)

Specifically, I'm thinking of writing a "pandas for structured data" that allows you to perform in-memory queries on records with optional and repeating fields, using the trick from the Dremel paper to store the data as contiguous arrays. It would be cool if I could make use of some of the interfaces in this proposal to simplify and speed up interop with pandas.

Another example is xarray, which is inspired by pandas, but exposes an alternative data structure for working with N-dimensional datasets.

ibab avatar Aug 25 '16 18:08 ibab

In theory, yes, though I would love to see that "alternative dataframe" (more of a real "database table"-like object) built inside pandas. C++ API stability for 3rd party libraries may take a little bit of time, but it would be useful before that happens to understand the kinds of requirements you might have for using libpandas as a C++ library dependency

wesm avatar Aug 25 '16 19:08 wesm

The idea is to keep it as similar to the pandas API as possible, for example by having the .querymethod intelligently prune the records, similar to what's possible with the SQL-like API described in the Dremel paper. "Flattening" the structure of the stored records should simply produce a pandas DataFrame.

Alternatively, I'm thinking of basing the library on Arrow, but being able to manipulate pandas objects directly could have a lot of benefits.

ibab avatar Aug 25 '16 19:08 ibab

I'm interested in evaluating the Arrow memory layout for nested data structures. If you don't require mutation, then the "fully shredded" columnar layout is fairly ideal for Dremel-style analytics (any values that you scan will generally be contiguous in memory, and repeated-value flattening can be done without copying memory)

wesm avatar Aug 26 '16 00:08 wesm