pandas2
pandas2 copied to clipboard
Opening up "libpandas" for use by other projects
I really like the proposal so far 👍
Are there any plans to provide a semi-stable C++/Cython API that could be used by other projects for things beyond simple pandas integration? (As in being able to write your own alternative to DataFrame
)
Specifically, I'm thinking of writing a "pandas for structured data" that allows you to perform in-memory queries on records with optional and repeating fields, using the trick from the Dremel paper to store the data as contiguous arrays. It would be cool if I could make use of some of the interfaces in this proposal to simplify and speed up interop with pandas.
Another example is xarray, which is inspired by pandas, but exposes an alternative data structure for working with N-dimensional datasets.
In theory, yes, though I would love to see that "alternative dataframe" (more of a real "database table"-like object) built inside pandas. C++ API stability for 3rd party libraries may take a little bit of time, but it would be useful before that happens to understand the kinds of requirements you might have for using libpandas as a C++ library dependency
The idea is to keep it as similar to the pandas API as possible, for example by having the .query
method intelligently prune the records, similar to what's possible with the SQL-like API described in the Dremel paper.
"Flattening" the structure of the stored records should simply produce a pandas DataFrame.
Alternatively, I'm thinking of basing the library on Arrow, but being able to manipulate pandas objects directly could have a lot of benefits.
I'm interested in evaluating the Arrow memory layout for nested data structures. If you don't require mutation, then the "fully shredded" columnar layout is fairly ideal for Dremel-style analytics (any values that you scan will generally be contiguous in memory, and repeated-value flattening can be done without copying memory)