pandas
pandas copied to clipboard
API: Add DataFrameGroupBy and SeriesGroupBy to the top level API
Currently one cannot run from pandas import DataFrameGroupBy, SeriesGroupBy
. We should add these classes to the top level as they are particularly important for type-hinting in user code. Users shouldn't need to know they are defined in pandas.core.groupby.generic
.
Somewhat related: #6944, #19302
makes sense for typing purposes, but does it make sense for users to use these directly?
I would say no - they should only be invoked via DataFrame.groupby
or Series.groupby
. We could introduce code that raises when third party code calls their __init__
by inspecting the stack, but I would prefer adding the classes to the API docs with the warning that they should not be instantiated directly.
I agree this is useful for typing purposes but otherwise not, so I'm -1 on putting them is the main namespace. I suggest they go into pd.api.types
namespace, either directly or in a is_groupby
function.
Thanks @topper-123. I don't believe is_groupby
is sufficient for users when it comes to type-hinting, e.g. def foo(gb: DataFrameGroupBy) -> pd.DataFrame: ...
. The only downside I see to putting them in pd.api.types
is it seems harder for users to discover, but perhaps that will change overtime. I also don't see the benefit of putting them here as opposed to top-level, can you comment on that @topper-123?
Hi @rhshadrach. I'm positive to adding the groupby classes to the public namespace and my only hesitance is whether to put them in the main namespace or not.
My feeling is that the main namespace is very crowded already and so should be added to only sparingly, and these two classes will not be instantiated directly from the main namespace, so it may be confusing having them there. pd.api.types
seems like that kind where we keep specialized typing functionality, so could fit there.
I think the idea of having the main namespace be "objects / functions users should create / use" is a meaningful one, so +1 on putting this somewhere else. Looking through the current entries, the only potential class I see that a user should likely not be constructing themselves is Flags
.
Currently, pd.api.types
contains classes and functions for working with dtypes. This would be an expansion of that focus which makes me hesitant here. I am wondering if it would be more appropriate to introduce pd.api.typing
for type-hinting related objects, namely classes that should not be directly instantiated and aliases.
+1 on somewhere in pd.api, I don't have a very strong preference on pd.api.types
or pd.api.typing
. Maybe just make update the PR with your preferred solution (pd.api.typing
?), remove the draft marker and see the response?
There are probably a few other classes that could be exposed publicly, e..g. Resampler
is one I can think of, but there are probably others also.
We should be careful here with respect to what we want to do in the pandas code and encouraging people to use pandas-stubs
.
The following code works and type checks just fine if you have pandas-stubs
installed:
import pandas as pd
from pandas.core.groupby.generic import DataFrameGroupBy
def myfun(dfgb: DataFrameGroupBy):
return dfgb.sum()
df = pd.DataFrame(
{
"Animal": ["Falcon", "Falcon", "Parrot", "Parrot"],
"Max Speed": [380.0, 370.0, 24.0, 26.0],
}
)
dfg: DataFrameGroupBy = df.groupby("Animal")
print(myfun(dfg))
So isn't this just an issue of making sure that types like DataFrameGroupby
and SeriesGroupby
that show up in the docs are properly documented and then people know how to do the imports?
It's also worth mentioning that in the example above, if you were to do a reveal_type(dfg)
with pandas-stubs
, you would get _DataFrameGroupByScalar
, because we do things in the stubs to return different "classes" based on the arguments passed to groupby()
.
I don't think this is a typing issue. I think this is more about that we have docs that return classes that are not fully documented, and if you wanted to use those classes in your own type declarations, then you have to know where to import them from.
So isn't this just an issue of making sure that types like
DataFrameGroupby
andSeriesGroupby
that show up in the docs are properly documented and then people know how to do the imports?
While this works, I do not think the current state is a good solution.
- It'd be better if users didn't have to go through the depths of the pandas code to find the objects they want to type-hint. Having a single place to pull these classes from for type-hinting is more user friendly.
- By having users import from various places in the pandas code, the locations become part of the API and therefore must be deprecated prior to moving. Having all users import from
pandas.api.typing
gives stability. - It is not clear to users which objects in e.g.
pandas.core
are public. In particular, if we ever want deprecatepandas.core
in favor ofpandas._core
, the current location of these classes is not viable.
It's also worth mentioning that in the example above, if you were to do a
reveal_type(dfg)
withpandas-stubs
, you would get_DataFrameGroupByScalar
, because we do things in the stubs to return different "classes" based on the arguments passed togroupby()
.
Indeed, thanks, I do find that interesting. But I don't see the connection to this issue - are there additional problems incurred if the user was to import DataFrameGroupBy
from pandas.api.typing
rather than pandas.core.groupby.generic
?
While this works, I do not think the current state is a good solution.
- It'd be better if users didn't have to go through the depths of the pandas code to find the objects they want to type-hint. Having a single place to pull these classes from for type-hinting is more user friendly.
- By having users import from various places in the pandas code, the locations become part of the API and therefore must be deprecated prior to moving. Having all users import from
pandas.api.typing
gives stability.- It is not clear to users which objects in e.g.
pandas.core
are public. In particular, if we ever want deprecatepandas.core
in favor ofpandas._core
, the current location of these classes is not viable.
I agree with you here, except I'm not convinced that this is only needed for typing, so I'm not sure that calling it pandas.api.typing
is correct. Maybe pandas.api.auxclasses
(or something like that) to indicate that these are auxiliary classes meant to hold some kind of intermediate results.
It's also worth mentioning that in the example above, if you were to do a
reveal_type(dfg)
withpandas-stubs
, you would get_DataFrameGroupByScalar
, because we do things in the stubs to return different "classes" based on the arguments passed togroupby()
.Indeed, thanks, I do find that interesting. But I don't see the connection to this issue - are there additional problems incurred if the user was to import
DataFrameGroupBy
frompandas.api.typing
rather thanpandas.core.groupby.generic
?
I don't think so, but if you put these classes under pandas.api.typing
, there might be an expectation that other classes in the stubs should go there as well. I think it's fine that we have classes in the stubs that are not in the pandas source. I just don't think you should be putting these in a typing
category, because it might cause confusion.
Thanks @Dr-Irv; I agree with your comments. I've updated the linked PR with aux
rather than your suggested auxclasses
as it seemed more concise. How does that sound?
Thanks @Dr-Irv; I agree with your comments. I've updated the linked PR with
aux
rather than your suggestedauxclasses
as it seemed more concise. How does that sound?
Works for me.
I wrote a script to take the classes in pandas that occur in the documentation but aren't in the public API. There are likely some missing classes as well as false positives here, I've crossed off the ones that I can identify and put the ones I'm not familiar enough with in bold.
In addition to the top-level and pandas.api I'm considering the following as already being public: pandas.arrays, pandas.tseries, pandas.tseries.offesets.
Code
import pandas as pd
import glob
import os
import importlib
print(pd.__file__)
public_objects = (
set(dir(pd))
| set(dir(pd.api.extensions))
| set(dir(pd.api.indexers))
| set(dir(pd.api.interchange))
| set(dir(pd.api.types))
| set(dir(pd.arrays))
| set(dir(pd.tseries))
| set(dir(pd.tseries.offsets))
| set(dir(pd.errors))
)
directory = os.path.dirname(pd.__file__)
files = glob.glob(os.path.join(directory, '**/*.py'), recursive=True)
files = [e[len(directory):] for e in files if not e.startswith('/home/richard/dev/pandas/pandas/tests/')]
files = ['pandas' + e.replace('/', '.') for e in files]
files = [e[:-len('.py')] for e in files]
def get_classes(path):
module = importlib.import_module(path)
names = [e for e in dir(module) if not e.startswith('_') and not e.startswith('ABC')]
result = []
for name in names:
obj = getattr(module, name)
if getattr(obj, '__module__', None) is None:
continue
if not obj.__module__.startswith('pandas'):
continue
if not isinstance(obj, type):
continue
result.append(name)
return result
objects = set()
for file in files:
objects |= set(get_classes(file))
nonpublic = sorted(objects - public_objects)
directory = os.path.abspath(os.path.dirname(os.path.join(pd.__file__, '../../')))
doc_files = glob.glob(os.path.join(directory, 'doc/**/*.rst'), recursive=True)
docs = ''
for filename in doc_files:
if '/whatsnew/' in filename:
continue
docs += '\n'.join(open(filename, 'r').readlines())
for obj in nonpublic:
if f' {str(obj)} ' in docs or f' {str(obj)}.' in docs:
print('-', str(obj))
- ~AbstractHolidayCalendar~; this is in
pandas.tseries.holiday
- ~Apply~
- ~BlockManager~
- ~Column~
- DataFrameGroupBy
- Expanding
- ExponentialMovingWindow
- ~GroupBy~
- JsonReader
- ~Parser~ This doesn't appear to be returned by any methods; in particular it doesn't appear in pandas-stubs
- ~Properties~ Same as line above
- Resampler
- Rolling
- SeriesGroupBy
- StataReader
- StataWriter
- Styler
- TimeGrouper
- ~Version~
- Window
- ~XlsxWriter~
- ~disallow~
ParserError ParserWarning
These are documented in doc/source/reference/testing.rst
and exposed in pandas/errors/__init__.py
Column
This is an object from the dataframe interchange protocol that I think should be an implementation detail.
You can also consider pandas.errors
as public, which should remove ParserError/Warning.
For XlsxWriter, pd.ExcelWriter
base class is already public, is that sufficient?
And pandas.io
is maybe also semi public (we have some usages of that in the docs)
Thanks @mroeschke / @jorisvandenbossche - I've updated the script to include pandas.errors
and remove the ParserError/Warning.
For XlsxWriter,
pd.ExcelWriter
base class is already public, is that sufficient?
If we agree that subclasses of ExcelWriter should not be allowed to add any public methods, then it would be sufficient. That looks to currently be the case, and I think that is a reasonable restriction on the subclasses. I've crossed out XlsxWriter and opened #49602.
And
pandas.io
is maybe also semi public (we have some usages of that in the docs)
The original motivation of this issue is to start making more clear to users what is / isn't public, so unless pandas.io
is fully public, I think it would be helpful to keep those items on the list above. This doesn't necessarily mean they get added to pandas.api.aux
or pandas.api.typing
, perhaps there is a different way to handle them.
I've edited the list above, crossing off a few classes that are purely internal.
From the November dev call, I recall there was support for pandas.api.typing
with the main purpose of exposing these classes for users typing their code as well as the similarity to numpy.typing
. There was also concern that this may cause users confusion with pandas-stubs and the typing classes there (as in https://github.com/pandas-dev/pandas/issues/48577#issuecomment-1262847817).
My take: while I agree there is potential for some confusion, I don't believe it to be very high, and would prefer pandas.api.typing
. However I'm also okay with pandas.api.aux
.
From the November dev call, I recall there was support for
pandas.api.typing
with the main purpose of exposing these classes for users typing their code as well as the similarity tonumpy.typing
. There was also concern that this may cause users confusion with pandas-stubs and the typing classes there (as in #48577 (comment)).
unfortunately I missed the November dev call. I disagree with the comparison to numpy.typing
because all typing for numpy is included in the package. For pandas, we are providing typing in a separate package.
My take: while I agree there is potential for some confusion, I don't believe it to be very high, and would prefer
pandas.api.typing
. However I'm also okay withpandas.api.aux
.
I’m concerned about setting a precedent if we use pandas.api.typing
.
If the purpose is really only for typing, than it seems most natural to have it at either pd.typing
or pd.api.typing
. This intended use doesn't really fit the definition of auxiliary, which sounds like a place for some helper routines.
A thought: I never really liked that the type checking functions are located in pd.api.types
. Could an idea be to:
- move all the type checking functions (
is_*
/infer_dtype
) to a newpandas.api.type_checkers
- Add the proposes types in this issue to
pandas.api.types
and use that for actual types
The functions is pandas.api.types
would of course go through a deprecation process, but we could hide them in a __dir__
function in the pandas.api.types
module, to make having deprecated functions there less annoying.
I disagree with the comparison to
numpy.typing
because all typing for numpy is included in the package. For pandas, we are providing typing in a separate package.
I would differentiate the two main (at least, in my opinion) purposes of type-hinting: (a) is increasing readability and (b) static type checking. You can accomplish (a) without e.g. pandas-stubs. In this sense, both numpy.typing and pandas.api.typing would contain classes for the user to add type hints to their code, regardless of if that user is going to use static type-checking.
In this sense, both numpy.typing and pandas.api.typing would contain classes for the user to add type hints to their code, regardless of if that user is going to use static type-checking.
I see your point. As long as we document it that way, I won't stand in the way of calling it pandas.api.typing
.
@TomAugspurger commented https://github.com/pandas-dev/pandas/issues/27522#issuecomment-514164921 that if we're disciplined
about making private modules private with a leading underscore,
there's no need for a .api
namespace.
Now if https://github.com/pandas-dev/pandas/pull/48578 is merged we have effectively removed some barriers to making pandas.core
private. #27522 and I think in general there is support to do that.