pandas-stubs Public pandas API and pandas-stubs

The public API is defined based on this page https://pandas.pydata.org/docs/reference/index.html Since there are many classes/functions that should be public but are not listed there it becomes a bit ambiguous what is public and what is not. People often seem to assume that classes/functions that do not start with _ are by common conventions public (see for example https://github.com/fphammerle/freesurfer-stats/pull/15, as far as I know nothing in io/common is public).

Ideally, the docs are updated to make it less of a guessing game which parts are public and which are not. Personally, I think it would be great if pandas uses the criteria of a py.typed library to determine what is public (use _, __all__, and redundant imports): that should make it unambiguous what is public and what is not, even if a user never looks at the docs. The best time to add deprecations would be before the 1.5 releases!

In the meantime, it would be great to have a discussion of what should be in the stubs and what shouldn't be in here to avoid more confusion/ambiguity (don't want to "endorse" something as public that is not meant to be public). One rather strict approach could be to add only new classes/functions that are reported through "reasonable" issues: annotate what is being used. (edit: another option is to only add annotations for what is referenced in the various api.pys)

cc @Dr-Irv @bashtage @jreback @simonjayhawkins

Jul 22 '22 14:07 twoertwein

There are a few issues here that come to me right now about this:

What should pandas document (in the documentation, and in the code) about what is public and what is private? Should pandas code start making things more private? (Maybe that should be a new pandas issue)
Should we or can we (without causing CI failures) remove things from pandas-stubs that we know are not public? For example, there are classes and functions in pandas/core/internals that we know are not public, so should we just remove the stub?
Somewhat related to (2). The stubs were initially generated based on pandas 1.1 or 1.2, with stubgen. While the pandas public API is mostly stable, the internal API has evolved. The stubs include declarations for internal methods that have evolved. Should the stubs evolve as well to keep the internal methods properly documented?
The convention with the stubs that I've used so far is that if someone uncovers an issue with the public API where the stubs are incorrect, we fix that. It's a "whack-a-mole" approach. But we don't have a process yet for incorporating changes to the public API into the stubs, unless someone points it out.
We should decide whether the stubs should support deprecated methods. For example, DataFrame.append() is now deprecated. Should we remove it from the stubs to discourage use?

Jul 22 '22 14:07 Dr-Irv

I think it would be good idea for pandas to explicitly privatize things. Are are many objects that are clearly intended to be private but are not clearly marked as such.
I would vote yes since it will dramatically reduce the surface area of pandas stubs and make it easier to test and maintain.
If we go for 2, then the only internal objects that would need to be tracked are those that are needed to type public classes and function.
Syncing formally with the public API would be a good start. I was wondering if we could use some existing pandas tests to make sure that typing is correct.
I think stubs should have a pandas target. For example, today I would make it 1.4.x. Once 1.5 is out, I think targettign 1.5 would be good. Anything that works correctly in the targeted method, even if deprecated, should be included (subject to point 2).

Jul 22 '22 14:07 bashtage

5. We should decide whether the stubs should support deprecated methods. For example, DataFrame.append() is now deprecated. Should we remove it from the stubs to discourage use?

I would prefer removing annotations for deprecated methods: that would also be consistent with for example the keyword-only arguments in read_csv and many other functions (positional arguments are still accepted but the implementation emits a deprecation warning).

Jul 22 '22 14:07 twoertwein

4. Syncing formally with the public API would be a good start. I was wondering if we could use some existing pandas tests to make sure that typing is correct.

agree. This was one of the main reasons that I supported a separate pandas-stubs repo. https://mail.python.org/pipermail/pandas-dev/2022-April/001462.html

A) Let's not manage the public facing stubs as part of the pandas project, and have a separate pandas-stubs project that we manage, using the MS stubs as a starting point.

Originally it was decided that this would be a maintenance burden and may lead to inconsistencies. I think it is fine to revisit this in light of a couple of years of lessons learnt and also that there is now also a public api typing testing framework that we may be able to reduce (eliminate) the inconsistencies if the same tests are run on the pandas codes and the pandas stubs.

I think probably much better than the "whack-a-mole" approach even though this method is commonly used and accepted for typing stubs.

There are a few issues here that come to me right now about this:

What should pandas document (in the documentation, and in the code) about what is public and what is private? Should pandas code start making things more private? (Maybe that should be a new pandas issue)

yes. any issues on what should/should not be public/private should be discussed on the main repo.

Should we or can we (without causing CI failures) remove things from pandas-stubs that we know are not public? For example, there are classes and functions in pandas/core/internals that we know are not public, so should we just remove the stub?

yes and yes. agree with @bashtage https://github.com/pandas-dev/pandas-stubs/issues/161#issuecomment-1192639372. Typing helps identify objects that should be public from the return types. Such as iterators for the IO chunk reads. But then only a few methods of the object would probably be considered public.

Somewhat related to (2). The stubs were initially generated based on pandas 1.1 or 1.2, with stubgen. While the pandas public API is mostly stable, the internal API has evolved. The stubs include declarations for internal methods that have evolved. Should the stubs evolve as well to keep the internal methods properly documented?

n/a given 2.

The convention with the stubs that I've used so far is that if someone uncovers an issue with the public API where the stubs are incorrect, we fix that. It's a "whack-a-mole" approach. But we don't have a process yet for incorporating changes to the public API into the stubs, unless someone points it out.

see response above but ok for now.

We should decide whether the stubs should support deprecated methods. For example, DataFrame.append() is now deprecated. Should we remove it from the stubs to discourage use?

agree with @twoertwein https://github.com/pandas-dev/pandas-stubs/issues/161#issuecomment-1192659646

Jul 23 '22 10:07 simonjayhawkins

pandas-stubs pandas-stubs copied to clipboard

Public pandas API and pandas-stubs

pandas-stubs
pandas-stubs copied to clipboard