pandas API: Add DataFrameGroupBy and SeriesGroupBy to the top level API

API: Add DataFrameGroupBy and SeriesGroupBy to the top level API

Open rhshadrach opened this issue 1 year ago • 5 comments

Currently one cannot run from pandas import DataFrameGroupBy, SeriesGroupBy. We should add these classes to the top level as they are particularly important for type-hinting in user code. Users shouldn't need to know they are defined in pandas.core.groupby.generic.

Somewhat related: #6944, #19302

Sep 16 '22 01:09 rhshadrach

makes sense for typing purposes, but does it make sense for users to use these directly?

Sep 16 '22 16:09 jbrockmendel

I would say no - they should only be invoked via DataFrame.groupby or Series.groupby. We could introduce code that raises when third party code calls their __init__ by inspecting the stack, but I would prefer adding the classes to the API docs with the warning that they should not be instantiated directly.

Sep 17 '22 01:09 rhshadrach

I agree this is useful for typing purposes but otherwise not, so I'm -1 on putting them is the main namespace. I suggest they go into pd.api.types namespace, either directly or in a is_groupby function.

Sep 18 '22 05:09 topper-123

Thanks @topper-123. I don't believe is_groupby is sufficient for users when it comes to type-hinting, e.g. def foo(gb: DataFrameGroupBy) -> pd.DataFrame: .... The only downside I see to putting them in pd.api.types is it seems harder for users to discover, but perhaps that will change overtime. I also don't see the benefit of putting them here as opposed to top-level, can you comment on that @topper-123?

Sep 18 '22 21:09 rhshadrach

Hi @rhshadrach. I'm positive to adding the groupby classes to the public namespace and my only hesitance is whether to put them in the main namespace or not.

My feeling is that the main namespace is very crowded already and so should be added to only sparingly, and these two classes will not be instantiated directly from the main namespace, so it may be confusing having them there. pd.api.types seems like that kind where we keep specialized typing functionality, so could fit there.

Sep 20 '22 19:09 topper-123

I think the idea of having the main namespace be "objects / functions users should create / use" is a meaningful one, so +1 on putting this somewhere else. Looking through the current entries, the only potential class I see that a user should likely not be constructing themselves is Flags.

Currently, pd.api.types contains classes and functions for working with dtypes. This would be an expansion of that focus which makes me hesitant here. I am wondering if it would be more appropriate to introduce pd.api.typing for type-hinting related objects, namely classes that should not be directly instantiated and aliases.

Sep 24 '22 14:09 rhshadrach

+1 on somewhere in pd.api, I don't have a very strong preference on pd.api.types or pd.api.typing. Maybe just make update the PR with your preferred solution (pd.api.typing?), remove the draft marker and see the response?

There are probably a few other classes that could be exposed publicly, e..g. Resampler is one I can think of, but there are probably others also.

Sep 25 '22 18:09 topper-123

We should be careful here with respect to what we want to do in the pandas code and encouraging people to use pandas-stubs.

The following code works and type checks just fine if you have pandas-stubs installed:

import pandas as pd
from pandas.core.groupby.generic import DataFrameGroupBy


def myfun(dfgb: DataFrameGroupBy):
    return dfgb.sum()


df = pd.DataFrame(
    {
        "Animal": ["Falcon", "Falcon", "Parrot", "Parrot"],
        "Max Speed": [380.0, 370.0, 24.0, 26.0],
    }
)
dfg: DataFrameGroupBy = df.groupby("Animal")
print(myfun(dfg))

So isn't this just an issue of making sure that types like DataFrameGroupby and SeriesGroupby that show up in the docs are properly documented and then people know how to do the imports?

It's also worth mentioning that in the example above, if you were to do a reveal_type(dfg) with pandas-stubs, you would get _DataFrameGroupByScalar, because we do things in the stubs to return different "classes" based on the arguments passed to groupby().

I don't think this is a typing issue. I think this is more about that we have docs that return classes that are not fully documented, and if you wanted to use those classes in your own type declarations, then you have to know where to import them from.

Sep 29 '22 18:09 Dr-Irv

So isn't this just an issue of making sure that types like DataFrameGroupby and SeriesGroupby that show up in the docs are properly documented and then people know how to do the imports?

While this works, I do not think the current state is a good solution.

It'd be better if users didn't have to go through the depths of the pandas code to find the objects they want to type-hint. Having a single place to pull these classes from for type-hinting is more user friendly.
By having users import from various places in the pandas code, the locations become part of the API and therefore must be deprecated prior to moving. Having all users import from pandas.api.typing gives stability.
It is not clear to users which objects in e.g. pandas.core are public. In particular, if we ever want deprecate pandas.core in favor of pandas._core, the current location of these classes is not viable.

Sep 29 '22 21:09 rhshadrach

It's also worth mentioning that in the example above, if you were to do a reveal_type(dfg) with pandas-stubs, you would get _DataFrameGroupByScalar, because we do things in the stubs to return different "classes" based on the arguments passed to groupby().

Indeed, thanks, I do find that interesting. But I don't see the connection to this issue - are there additional problems incurred if the user was to import DataFrameGroupBy from pandas.api.typing rather than pandas.core.groupby.generic?

Sep 29 '22 21:09 rhshadrach

While this works, I do not think the current state is a good solution.

It'd be better if users didn't have to go through the depths of the pandas code to find the objects they want to type-hint. Having a single place to pull these classes from for type-hinting is more user friendly.

By having users import from various places in the pandas code, the locations become part of the API and therefore must be deprecated prior to moving. Having all users import from pandas.api.typing gives stability.

It is not clear to users which objects in e.g. pandas.core are public. In particular, if we ever want deprecate pandas.core in favor of pandas._core, the current location of these classes is not viable.

I agree with you here, except I'm not convinced that this is only needed for typing, so I'm not sure that calling it pandas.api.typing is correct. Maybe pandas.api.auxclasses (or something like that) to indicate that these are auxiliary classes meant to hold some kind of intermediate results.

Sep 29 '22 21:09 Dr-Irv

It's also worth mentioning that in the example above, if you were to do a reveal_type(dfg) with pandas-stubs, you would get _DataFrameGroupByScalar, because we do things in the stubs to return different "classes" based on the arguments passed to groupby().

Indeed, thanks, I do find that interesting. But I don't see the connection to this issue - are there additional problems incurred if the user was to import DataFrameGroupBy from pandas.api.typing rather than pandas.core.groupby.generic?

I don't think so, but if you put these classes under pandas.api.typing, there might be an expectation that other classes in the stubs should go there as well. I think it's fine that we have classes in the stubs that are not in the pandas source. I just don't think you should be putting these in a typing category, because it might cause confusion.

Sep 29 '22 21:09 Dr-Irv

Thanks @Dr-Irv; I agree with your comments. I've updated the linked PR with aux rather than your suggested auxclasses as it seemed more concise. How does that sound?

Oct 23 '22 14:10 rhshadrach

Thanks @Dr-Irv; I agree with your comments. I've updated the linked PR with aux rather than your suggested auxclasses as it seemed more concise. How does that sound?

Works for me.

Oct 28 '22 19:10 Dr-Irv

I wrote a script to take the classes in pandas that occur in the documentation but aren't in the public API. There are likely some missing classes as well as false positives here, I've crossed off the ones that I can identify and put the ones I'm not familiar enough with in bold.

In addition to the top-level and pandas.api I'm considering the following as already being public: pandas.arrays, pandas.tseries, pandas.tseries.offesets.

Code

import pandas as pd
import glob
import os
import importlib

print(pd.__file__)

public_objects = (
    set(dir(pd))
    | set(dir(pd.api.extensions))
    | set(dir(pd.api.indexers))
    | set(dir(pd.api.interchange))
    | set(dir(pd.api.types))
    | set(dir(pd.arrays))
    | set(dir(pd.tseries))
    | set(dir(pd.tseries.offsets))
    | set(dir(pd.errors))
)

directory = os.path.dirname(pd.__file__)
files = glob.glob(os.path.join(directory, '**/*.py'), recursive=True)
files = [e[len(directory):] for e in files if not e.startswith('/home/richard/dev/pandas/pandas/tests/')]
files = ['pandas' + e.replace('/', '.') for e in files]
files = [e[:-len('.py')] for e in files]

def get_classes(path):
    module = importlib.import_module(path)
    names = [e for e in dir(module) if not e.startswith('_') and not e.startswith('ABC')]
    
    result = []
    for name in names:
        obj = getattr(module, name)
        if getattr(obj, '__module__', None) is None:
            continue
        if not obj.__module__.startswith('pandas'):
            continue
        if not isinstance(obj, type):
            continue
        result.append(name)
    return result

objects = set()
for file in files:
    objects |= set(get_classes(file))

nonpublic = sorted(objects - public_objects)

directory = os.path.abspath(os.path.dirname(os.path.join(pd.__file__, '../../')))
doc_files = glob.glob(os.path.join(directory, 'doc/**/*.rst'), recursive=True)
docs = ''
for filename in doc_files:
    if '/whatsnew/' in filename:
        continue
    docs += '\n'.join(open(filename, 'r').readlines())

for obj in nonpublic:
    if f' {str(obj)} ' in docs or f' {str(obj)}.' in docs:
        print('-', str(obj))

~AbstractHolidayCalendar~; this is in pandas.tseries.holiday
~Apply~
~BlockManager~
~Column~
DataFrameGroupBy
Expanding
ExponentialMovingWindow
~GroupBy~
JsonReader
~Parser~ This doesn't appear to be returned by any methods; in particular it doesn't appear in pandas-stubs
~Properties~ Same as line above
Resampler
Rolling
SeriesGroupBy
StataReader
StataWriter
Styler
TimeGrouper
~Version~
Window
~XlsxWriter~
~disallow~

Nov 09 '22 19:11 rhshadrach

ParserError ParserWarning

These are documented in doc/source/reference/testing.rst and exposed in pandas/errors/__init__.py

Column

This is an object from the dataframe interchange protocol that I think should be an implementation detail.

Nov 09 '22 19:11 mroeschke

You can also consider pandas.errors as public, which should remove ParserError/Warning.

For XlsxWriter, pd.ExcelWriter base class is already public, is that sufficient?

And pandas.io is maybe also semi public (we have some usages of that in the docs)

Nov 09 '22 19:11 jorisvandenbossche

Thanks @mroeschke / @jorisvandenbossche - I've updated the script to include pandas.errors and remove the ParserError/Warning.

For XlsxWriter, pd.ExcelWriter base class is already public, is that sufficient?

If we agree that subclasses of ExcelWriter should not be allowed to add any public methods, then it would be sufficient. That looks to currently be the case, and I think that is a reasonable restriction on the subclasses. I've crossed out XlsxWriter and opened #49602.

And pandas.io is maybe also semi public (we have some usages of that in the docs)

The original motivation of this issue is to start making more clear to users what is / isn't public, so unless pandas.io is fully public, I think it would be helpful to keep those items on the list above. This doesn't necessarily mean they get added to pandas.api.aux or pandas.api.typing, perhaps there is a different way to handle them.

Nov 09 '22 20:11 rhshadrach

I've edited the list above, crossing off a few classes that are purely internal.

From the November dev call, I recall there was support for pandas.api.typing with the main purpose of exposing these classes for users typing their code as well as the similarity to numpy.typing. There was also concern that this may cause users confusion with pandas-stubs and the typing classes there (as in https://github.com/pandas-dev/pandas/issues/48577#issuecomment-1262847817).

My take: while I agree there is potential for some confusion, I don't believe it to be very high, and would prefer pandas.api.typing. However I'm also okay with pandas.api.aux.

Dec 15 '22 02:12 rhshadrach

From the November dev call, I recall there was support for pandas.api.typing with the main purpose of exposing these classes for users typing their code as well as the similarity to numpy.typing. There was also concern that this may cause users confusion with pandas-stubs and the typing classes there (as in #48577 (comment)).

unfortunately I missed the November dev call. I disagree with the comparison to numpy.typing because all typing for numpy is included in the package. For pandas, we are providing typing in a separate package.

My take: while I agree there is potential for some confusion, I don't believe it to be very high, and would prefer pandas.api.typing. However I'm also okay with pandas.api.aux.

I’m concerned about setting a precedent if we use pandas.api.typing.

Dec 15 '22 14:12 Dr-Irv

If the purpose is really only for typing, than it seems most natural to have it at either pd.typing or pd.api.typing. This intended use doesn't really fit the definition of auxiliary, which sounds like a place for some helper routines.

Dec 15 '22 17:12 bashtage

A thought: I never really liked that the type checking functions are located in pd.api.types. Could an idea be to:

move all the type checking functions (is_*/infer_dtype) to a new pandas.api.type_checkers
Add the proposes types in this issue to pandas.api.types and use that for actual types

The functions is pandas.api.types would of course go through a deprecation process, but we could hide them in a __dir__ function in the pandas.api.types module, to make having deprecated functions there less annoying.

Dec 15 '22 22:12 topper-123

I disagree with the comparison to numpy.typing because all typing for numpy is included in the package. For pandas, we are providing typing in a separate package.

I would differentiate the two main (at least, in my opinion) purposes of type-hinting: (a) is increasing readability and (b) static type checking. You can accomplish (a) without e.g. pandas-stubs. In this sense, both numpy.typing and pandas.api.typing would contain classes for the user to add type hints to their code, regardless of if that user is going to use static type-checking.

Dec 16 '22 03:12 rhshadrach

In this sense, both numpy.typing and pandas.api.typing would contain classes for the user to add type hints to their code, regardless of if that user is going to use static type-checking.

I see your point. As long as we document it that way, I won't stand in the way of calling it pandas.api.typing .

Dec 16 '22 19:12 Dr-Irv

@TomAugspurger commented https://github.com/pandas-dev/pandas/issues/27522#issuecomment-514164921 that if we're disciplined about making private modules private with a leading underscore, there's no need for a .api namespace.

Now if https://github.com/pandas-dev/pandas/pull/48578 is merged we have effectively removed some barriers to making pandas.core private. #27522 and I think in general there is support to do that.

Mar 08 '23 13:03 simonjayhawkins

pandas pandas copied to clipboard

API: Add DataFrameGroupBy and SeriesGroupBy to the top level API

pandas
pandas copied to clipboard