potentially relevant usage patterns / targets for a developer-focused API
In other issues we find some detailed analyses of how the pandas API is used today, e.g. gh-3 (on Kaggle notebooks) and in https://github.com/data-apis/python-record-api/tree/master/data/api (for a set of well-known packages). That data is either not relevant for a developer-focused API though, or is so detailed that it's hard to get a good feel for what's important. So I thought it'd be useful to revisit the topic. I used https://libraries.io/pypi/pandas and looked at some of the top repos that declare a dependency on pandas.
Top 10 listed:
Seaborn
Perhaps the most interesting pandas usage. It's a hard dependency, is used a fair amount and for more than just data access, however it all still seems fairly standard and common so may be a reasonable target to make work with multiple libraries. Uses a lot of isinstance checks (on pd.DataFrame, pd.Series).
-
seaborn/_core.py:Series,to_numeric -
seaborn/matrix.py:DataFrame,isnull,.index.equals,.column.equals, -
seaborn/utils.py:DataFrame,Categorical,notnull -
seaborn/regression.py: onlypd.notnull -
seaborn/distributions.py:.values,.copy,.iloc,.loc,.reset_index,.index,set_index,MultiIndex.from_arrays,Index,Series,concat,merge -
seaborn/relational.py:DataFrame,merge,.rename -
seaborn/categorical.py:DataFrame,iteritems,Series,notnull,option_context,isnull,groupby,get_group, -
seaborn/_statistics.py: onlySeries
Folium
just a single non-test usage, in pd.py:
def validate_location(location): # noqa: C901
"...J
if isinstance(location, np.ndarray) \
or (pd is not None and isinstance(location, pd.DataFrame)):
location = np.squeeze(location).tolist()
def if_pandas_df_convert_to_numpy(obj):
"""Return a Numpy array from a Pandas dataframe.
Iterating over a DataFrame has weird side effects, such as the first
row being the column names. Converting to Numpy is more safe.
"""
if pd is not None and isinstance(obj, pd.DataFrame):
return obj.values
else:
return obj
PyJanitor
Interesting/unusual common pattern, which extends pd.DataFrame through pandas_flavor with either accessors or methods:. E.g. from [janitor/biology.py]https://github.com/pyjanitor-devs/pyjanitor/blob/a6832d47d2cc86b0aef101bfbdf03404bba01f3e/janitor/biology.py):
import pandas as pd
import pandas_flavor as pf
@pf.register_dataframe_method
def join_fasta(
df: pd.DataFrame, filename: str, id_col: str, column_name: str
) -> pd.DataFrame:
"""
Convenience method to join in a FASTA file as a column.
"""
...
return df
Statsmodels
A huge amount of usage, using a large API surface in a messy way - not easy to do anything with or draw conclusions from.
NetworkX
Mostly just conversions to support pandas dataframes as input/output values. E.g., from convert.py and convert_matrix.py:
def to_networkx_graph(data, create_using=None, multigraph_input=False):
"""Make a NetworkX graph from a known data structure."""
# Pandas DataFrame
try:
import pandas as pd
if isinstance(data, pd.DataFrame):
if data.shape[0] == data.shape[1]:
try:
return nx.from_pandas_adjacency(data, create_using=create_using)
except Exception as err:
msg = "Input is not a correct Pandas DataFrame adjacency matrix."
raise nx.NetworkXError(msg) from err
else:
try:
return nx.from_pandas_edgelist(
data, edge_attr=True, create_using=create_using
)
except Exception as err:
msg = "Input is not a correct Pandas DataFrame edge-list."
raise nx.NetworkXError(msg) from err
except ImportError:
warnings.warn("pandas not found, skipping conversion test.", ImportWarning)
def from_pandas_adjacency(df, create_using=None):
try:
df = df[df.index]
except Exception as err:
missing = list(set(df.index).difference(set(df.columns)))
msg = f"{missing} not in columns"
raise nx.NetworkXError("Columns must match Indices.", msg) from err
A = df.values
G = from_numpy_array(A, create_using=create_using)
nx.relabel.relabel_nodes(G, dict(enumerate(df.columns)), copy=False)
return G
And using the .drop method in group.py:
def prominent_group(
G, k, weight=None, C=None, endpoints=False, normalized=True, greedy=False
):
import pandas as pd
...
betweenness = pd.DataFrame.from_dict(PB)
if C is not None:
for node in C:
# remove from the betweenness all the nodes not part of the group
betweenness.drop(index=node, inplace=True)
betweenness.drop(columns=node, inplace=True)
CL = [node for _, node in sorted(zip(np.diag(betweenness), nodes), reverse=True)]
Perspective
A multi-language (streaming) viz and analytics library. The Python version uses pandas in core/pd.py. It uses a small but nontrivial amount of the API, including MultiIndex, CategoricalDtype, and time series functionality.
Scikit-learn
TODO: the usage of Pandas in scikit-learn is very much in flux, and more support for "dataframe in, dataframe out" is being added. So it did not seem to make much sense to just look at the code, rather it makes sense to have a chat with the people doing the work there.
Matplotlib
Added because it comes up a lot. Matplotlib uses just a "dictionary of array-likes" approach, no dependence on pandas directly. So it will work today with other dataframe libraries as well, as long as their columns can convert to a numpy array.
Other libraries that were suggested as candidates to look into: Xarray, cuDF (utilities), PyJanitor (cleaning functionality, not the pandas_flavor domain-specific parts), https://github.com/sfu-db/dataprep
PyJanitor (non pandas_flavor code)
Not repeating DataFrame, Series and .columns, those are used a lot.
-
utils.py:.iloc,RangeIndex,MultiIndex,.empty,Index -
functions/add_columns.py:.copy,.add_column -
functions/case_when.py:.assign,.mask,.index,Index,.nlevels,.ndim,.size,__len__ -
functions/clean_names.py:.rename,.__dict__ -
functions.coalesce.py:.filter,.bfill,.ffill,.assign -
functions/complete.py:.copy,.merge,.groupby,.apply,.droplevel,.loc,Index,MultiIndex -
functions/conditional_join.py:.loc,.index,.empty,.copy,RangeIndex,MultiIndex,index,append,.to_numpy,.dtypes,.items,.join -
functions/convert_date.py:to_datetime,.astype,.apply -
functions/count_cumulative_unique.py:.drop_duplicates,.assign,.cumsum,.index,.reindex,.ffill,.astype -
functions/currency_column_to_numeric.py:to_numeric,.loc,.assign,.apply,
There's a ton more - it uses a fairly large part of the pandas API surface. Even in utils, a lot of the code is in functions that get then tacked onto pd.DataFrame with @pandas_flavor.register_dataframe_method. It does not seem like a great target for initial support via a developer-focused API. Detailed usage data is available at https://github.com/data-apis/python-record-api/blob/master/data/api/pyjanitor.json
Xarray
Detailed usage data is also available at https://github.com/data-apis/python-record-api/blob/master/data/api/xarray.json; that data and a cursory search through the Xarray code base for "import pandas" shows that it uses an even larger API surface. A decent amount of that usage is in tests - that's not actually relevant. This is one of the downsides of the automated analysis tooling, if one traces pandas API usage from running the Xarray test suite, then it's hard to figure out whether the public pandas API usage is from the test files or the "under test" files. Pandas is still used in a lot of places though:
Note that Index is most commonly used, followed by Series and DataFrame, the below listing leaves them out of the results for some files.
-
testing.py:Index, -
conventions.py:MultiIndex,isnull,.any,__not__, -
convert.py:isnull -
coding/times.py:Timestamp,to_timedelta,to_datetime,__version__,notnull,isnull,DatetimeIndex -
coding/frequencies.py:Series,DatetimeIndex,TimedeltaIndex,infer_freq -
coding/cftimeindex.py:Index,TimedeltaIndex -
coding/variables.py:isnull -
core/common.py:Index,Grouper -
core/nputils.py:isnull -
core/merge.py:Series,DataFrame,Panel,Index -
core/dataarray.py:Series,DataFrame,MultiIndex,Timedelta,isnull -
core/concat.py:unique -
core/resample_cftime.py:Series,.duplicated -
core/pdcompat.py:Panel -
core/accessor_dt.py:.dt -
core/duck_array_ops.py:Timedelta,to_timedelta,.astype -
core/utils.py:.factorize,MultiIndex,isnull -
core/variable.py:Timestamp,MultiIndex.names,MultiIndex.set_names, -
core/indexing.py:MultiIndex+ methods:.nlevels/.get_loc/.get_loc_level,CategoricalIndex,PeriodIndex,NaT,Timestamp -
core/indexes.py:MultiIndex+ methodfrom_arrays,CategoricalIndex+ methodremove_unused_categories, -
core/dataset.py:MultiIndex,Categorical+.codes/.categories -
core/groupby.py:factorize,DateOffset+.loffset,DatetimeIndex,cut,MultiIndex -
core/alignment.py:Index+.union,.intersection -
core/missing.py:isnull,MultiIndex,Timedelta,DatetimeIndex -
core/coordinates.py:MultiIndex.from_product -
core/formatting.py:isnull,Timestamp,Timedelta,.astype -
plot/dataset_plot.py:Interval -
plot/plot.py:notnull
There is a ton of isinstance usage (e.g. with the various index objects), because Xarray supports both its own container/index classes and pandas ones. Usage seems to be quite different from typical/idiomatic Pandas usage, because Xarray has pretty specific needs.
dataprep
https://github.com/sfu-db/dataprep doesn't seem suitable for analysis - it contains 212 files with pandas imports, a lot of them quite niche (example: a separate file for Albanian VAT number cleaning/validation).
Hey, very cool initiative — it would be great to be more agnostic to dataframe libraries.
I wanted to flag that seaborn is in the midst of a very extensive internal refactor, which means that the survey of pandas usage in the library is likely to be out of date after future releases.
But there's an upside: it's a perfect time to be revisiting how the pandas API is used in seaborn and to proactively think about working with a more general dataframe interface. I could see the ongoing work evolving in parallel with this project (hopefully in a way that's mutually beneficial).
Let me know if I can be helpful here!
Scikit-learn mostly treats a DataFrame as a "2D ndarray with column names". Only the OrdinalEncoder and OneHotEncoder treats the data frame as "a collection of 1D arrays".
When scikit-learn's models start returning DataFrames, it will depend on the fact that there is a zero-copy round-trip from numpy: https://github.com/pandas-dev/pandas/issues/27211. In detail:
- First model does computation with ndarrays and is converted to a DataFrame when returned.
- The DataFrame is passed into a second model, which internally converts the DataFrame into a ndarray for computation.
Scikit-learn requires that 2d ndarray -> DataFrame -> 2d ndarray not make any copies so no additional memory is allocated.
Interesting, thanks for sharing @thomasjpfan.
When scikit-learn's models start returning DataFrames, it will depend on the fact that there is a zero-copy round-trip from numpy: pandas-dev/pandas#27211.
The answers from the Pandas devs there are along the lines of what I'd expect: this isn't necessarily guaranteed in the future. That's more a "labeled array" use case which is Xarray like. Did anything change after that 2019 discussion @thomasjpfan, or is it more a "fingers crossed that Pandas doesn't change this"?
Scikit-learn requires that 2d ndarray -> DataFrame -> 2d ndarray not make any copies so no additional memory is allocated
I think pragmatically there is likely to always be a way for Pandas to do this; scikit-learn is probably important enough that it can have even its own method for this if needed. Conceptually it's not a nice fit though for a standardized dataframe behavior; it only works for a subset of supported dtypes, and it's going to need support for a constructor which accepts 2-D arrays to begin with.
is it more a "fingers crossed that Pandas doesn't change this"?
It's fingers crossed. I've seen a proposal for a 2D extension array, but I think there is a lot more momentum for 1d extension arrays & a columnar store.
I want to add: There are certain models, such as StandardScalar, that can treat the dataframe as a "1d collection of arrays" but is not implemented that way yet. Other models such as PCA will always need to concat the 1d arrays into a 2d array to work.
PyJanitor (cleaning functionality, not the pandas_flavor domain-specific parts)
Looks like they only really use rename here, which could easily be standardised
https://github.com/pyjanitor-devs/pyjanitor/blob/7ad98e3564f86534094e4eb425d85ff9a25a3679/janitor/functions/clean_names.py#L84-L106
The trickier part is this decorator, which also uses pandas_flavour:
https://github.com/pyjanitor-devs/pyjanitor/blob/7ad98e3564f86534094e4eb425d85ff9a25a3679/janitor/functions/clean_names.py#L11-L12
pyjanitor adds an extra clean_names method to the pandas DataFrame. How would they make use of the Standard - would they add such a method to all DataFrame objects who have some implementation of the standard?
Would the Standard need to require some decorator that can be used to register custom methods?
Would it actually be possible for pyjanitor to then register clean_names as a method for all libraries, without having to list them all explicitly? Asking because I don't know - although it strikes me as unlikely
It looks to me like there are two separate things in PyJanitor:
- Functionality implemented through code that calls pandas APIs (dataframe methods and attributes mostly, not just
rename) - An unusual way of exposing its own PyJanitor API, namely injecting methods into the dataframe of another library, rather than providing standalone functions.
(2) looks motivated only by UX reasons (I could well be wrong here, not being an active user) - dataframe users tend to like methods over functions. It seems unhealthy to me, because one library monkeypatching another library is a big no-no in library design. Any df.new_meth(...) could have been new_func(df, ...) instead I think.
It's actually an interesting question whether (2) should be allowed through a registration mechanism, or it should be discouraged. I'd lean towards the latter, but then again I'm coming from a domain where a functional programming style is preferred over an object-oriented one. If dataframe library author prefer the former, then a well-defined extension mechanism seems useful. Even for PyJanitor + Pandas only.
OK true, their methods do work as functions too:
In [2]: from janitor.functions.clean_names import clean_names
In [3]: df = pd.DataFrame({'A ': [1, 2, 3]})
In [4]: df
Out[4]:
A
0 1
1 2
2 3
In [5]: clean_names(df)
Out[5]:
a_
0 1
1 2
2 3
So, perhaps that's the part which the standard can target. It might be worthwhile to try taking a handful of functions from them, say:
-
clean_names -
drop_constant_columns -
min_max_scale
Then try implementing the Standard for each DataFrame library, seeing if it's sufficient, and whether this would let pyjanitor "just work" on all of them if it was rewritten to use the standard api
If dataframe library author prefer the former, then a well-defined extension mechanism seems useful. Even for PyJanitor + Pandas only.
FWIW, for pandas itself this already exists (https://pandas.pydata.org/docs/dev/development/extending.html#registering-custom-accessors), and this is also what pyjanitor / pandas_flavor use under the hood (pandas_flavor adds some convenience layer on top of it).
Whether this would also be useful for a DataFrame standard is of course a different question. I think if our goal is to provide a developer-oriented standard API, this is much less needed.
Other tools which have been mentioned as potential targets:
- featuretools
- pandera
This one would be a good candidate, namely because they already support both pandas and polars: https://github.com/Kanaries/pygwalker
Well this is encouraging:
Now, all pandas-specific logic is isolated to specific modules, where support for additional non-pandas-compliant schema specifications and their associated backends can be implemented either as 1st-party-maintained libraries (see issues for supporting https://github.com/unionai-oss/pandera/issues/1064 and https://github.com/unionai-oss/pandera/issues/1105) or 3rd party libraries.
https://github.com/unionai-oss/pandera/releases/tag/v0.14.0
altair have added support for polars by using the interchange protocol: https://github.com/altair-viz/altair
pyarrow is required as a dependency for this to work though - with the standard, they could potentially support polars (and many others) without requiring extra deps? one to look into
EDIT: I don't think altair is a good candidate, see #133
Dropping Dask for now, as they've said this wouldn't solve an actual pain-point of theirs
Anyway, https://github.com/feature-engine/feature_engine looks like a good candidate, and exactly the kind of library where this might be useful!
Here's a really good one
https://github.com/Nixtla/statsforecast/blob/c732a6101ce0c9daec886928e0f68371772fcccc/statsforecast/core.py#L540-L633
they literally have
if isinstance(self.dataframe, pl.DataFrame):
# pandas-specific logic
elif isinstance(self.dataframe, pd.DataFrame):
# polars-specific logic
else:
raise
So yeah, really solid candidate here
another one, where they've already said that their objective is to support multiple dataframe backends https://github.com/skrub-data/skrub
others:
- scikit-lego
- tsfresh
- pandas-ta
hi all! pandera author here 👋, just wanted to drop a note here to say we're going to start investing resources in pandera-polars support: https://github.com/unionai-oss/pandera/issues/1064.
Not sure how far along this project is but would love to get some tips on how to design the polars validation backend as described in this mini-roadmap: https://github.com/unionai-oss/pandera/issues/1064#issuecomment-1584655803.
Was planning on forging ahead with polars-specific implementations for various things that pandera does during the validation pipeline (see anywhere there's a check_obj variable in the pandas backend as an example). If there's anything we should keep in mind as we build it out please add comments to that issue ^^, we'd really appreciate it!