netcdf4-python icon indicating copy to clipboard operation
netcdf4-python copied to clipboard

Add static typing support

Open headtr1ck opened this issue 1 year ago • 28 comments

Closes: #1298

Lots of this PR was stolen from https://github.com/Unidata/netcdf4-python/pull/1279

Notice: I intentionally reversed the import behavior of __init__.py and _netCDF4.pyx in the stubs. This should be more in line in how people are using netCDF4. And while doing so I also removed the "._netCDF4" part of the classname in the reprs.

The current typing is working with python 3.11. TODO:

  • [x] ~~Make typing pass with python 3.8 (aka. minimum supported version)~~ Won't do

headtr1ck avatar Dec 15 '23 21:12 headtr1ck

CLA assistant check
All committers have signed the CLA.

CLAassistant avatar Dec 15 '23 21:12 CLAassistant

Thanks! Looks good - but why did you add a new workflow to run stubtest, instead of adding the extra test to the existing workflows?

jswhit avatar Dec 29 '23 16:12 jswhit

why did you add a new workflow to run stubtest, instead of adding the extra test to the existing workflows?

Good question. I just took the changes from the previous PR #1279

After the holidays I can continue to work on this and use the existing workflows.

headtr1ck avatar Dec 29 '23 16:12 headtr1ck

Looks you like need to add git submodule update --init --recursive after the git clone in stubtest.yml.

jswhit avatar Dec 29 '23 17:12 jswhit

I was the author of #1279 , something went wrong with my pr and when I noticed this one was already made. Which is perfectly fine but now this pr has been dead for 5 months. The only problem I see now with this pr is that mypy was not installed in the the workflow. Could this just be added and move on? and merge latest version in this branch

Woefie avatar Jun 06 '24 08:06 Woefie

Hey, I was just very busy the last months. I can try to get it to work the next days.

And thanks for your initial work, it was very helpful to get started.

headtr1ck avatar Jun 06 '24 09:06 headtr1ck

I have now made Variable a generic class to solve the datatype problem. Unfortunately, at least for me, it does not seem to be able to resolve the non-np.dtype np.DTypeLikes. So if you as datatype e.g. netCDF4.VLType or np.int32 it works, but "i4" or "float32" will not resolve to np.dtype but to their Literals. I have not been able to solve this problem.

But I think it is already better than before and should also not block merging this PR. We can always fix this in the future as it is a minor problem.

One sad thing, is that we have to return Any for __getitem__ at Dataset/Group, because statically we cannot know if it returns a group or a variable. I don't think there is any solution for that.

headtr1ck avatar Jun 07 '24 21:06 headtr1ck

One sad thing, is that we have to return Any for __getitem__ at Dataset/Group, because statically we cannot know if it returns a group or a variable. I don't think there is any solution for that.

Wouldn't a Union work? def __getitem__(self, elem: str) -> Union[Group, Variable]: ...

Woefie avatar Jun 08 '24 17:06 Woefie

Wouldn't a Union work? def __getitem__(self, elem: str) -> Union[Group, Variable]: ...

Technically yes, but this will result in many false positive errors from mypy on user side. The recommended solution for that is to return Any and never return Unions.

headtr1ck avatar Jun 08 '24 17:06 headtr1ck

A sidenote:

  • I have removed the _netCDF4 module path from the reprs, I doubt that should affect anyone but one could consider this a breaking change.
  • I was adding always_mask to private_attrs, because I think it is correct after reading the C code (As well as to type it as read-only property).
  • Does anyone have objections on the matter that from a typing perspective the classes live in the __init__ and not in the _netCDF4 module?

Other than that, this is good to go :)

headtr1ck avatar Jun 19 '24 19:06 headtr1ck

A sidenote:

  • I have removed the _netCDF4 module path from the reprs, I doubt that should affect anyone but one could consider this a breaking change.
  • I was adding always_mask to private_attrs, because I think it is correct after reading the C code (As well as to type it as read-only property).
  • Does anyone have objections on the matter that from a typing perspective the classes live in the __init__ and not in the _netCDF4 module?

Other than that, this is good to go :)

@headtr1ck please see my questions in the review

jswhit avatar Jun 20 '24 15:06 jswhit

@headtr1ck please see my questions in the review

I don't see a review? Are you sure you published it?

headtr1ck avatar Jun 20 '24 16:06 headtr1ck

Don't know if anyone noticed, but dtype_is_complex seems to be wrong? It only works if you pass string literals like "c8" or "c16", but not if you pass a numpy dtype object like np.dtype("c8").

Edit: nvmd, actually np.dtype("c8") == "c8" returns True. Disregard my comment

headtr1ck avatar Jun 20 '24 20:06 headtr1ck

@headtr1ck please see my questions in the review

I don't see a review? Are you sure you published it?

oops, just submitted it

jswhit avatar Jun 21 '24 16:06 jswhit

@jswhit any reason, __has_parallel_support__ and __has_ncfilter__ were not imported at netCDF4 level (only available in netCDF4._netCDF4? I imported them now, but can undo that.

headtr1ck avatar Jun 21 '24 19:06 headtr1ck

Ugh, I wish I'd found this PR sooner. I've been working on the same thing but as a stubs package. Looks like I have some things to learn from your PR, like returning Any from __getitem__ rather than a Union.

One fun side-quest I took was writing a script with libcst to enable users to import the docstrings into the stub file.

RandallPittmanOrSt avatar Jun 21 '24 23:06 RandallPittmanOrSt

Ugh, I wish I'd found this PR sooner. I've been working on the same thing but as a stubs package. Looks like I have some things to learn from your PR, like returning Any from __getitem__ rather than a Union.

That's very unfortunate, bad timing, haha. After so many years to start this project at the same time. It seems that you also have some good ideas of better supporting numpy arrays with dtypes. Maybe we can merge these things somehow?

One fun side-quest I took was writing a script with libcst to enable users to import the docstrings into the stub file.

I don't really know what the advantages are of having the docstrings in the stub files. Do you have an example?

headtr1ck avatar Jun 23 '24 14:06 headtr1ck

@headtr1ck

Maybe we can merge these things somehow?

Yes. I have gone through our two stubs file side-by-side on paper and will make a few comments here today. I found some fairly minor errors/omissions, I have some thoughts on making Variable generic, and I have some thoughts on some of the TypeAliases.

I don't really know what the advantages are of having the docstrings in the stub files. Do you have an example?

Primarily pop-up help in IDEs, e.g. VSCode. Of course one can use help() in a REPL but in my mind having both the types and the docstring available in the IDE is nice, especially for obtuse methods you may not come across frequently. Perhaps "discoverability" is the buzzword. In any case, I decided that integrating the docstrings into the stubs should be done programmatically (keeping stubs up-to-date with code is enough of a chore without wading through all the docstrings). I tried making it a build step for my stubs package but it depends on which netCDF4 version the user has and that was too confusing. So I decided just to make it a script that came with the stubs so users could run it at-will if they wanted the docstrings.

If the netCDF4-python maintainers want that script they are welcome to it but I'm fine with just releasing it as a standalone tool (adapted for these integrated stubs).

RandallPittmanOrSt avatar Jun 24 '24 15:06 RandallPittmanOrSt

I think having the docstrings in two places sounds like a bad idea. This would require some CI checks to test if they are the same etc and would make future PRs complicated.

If the current docstrings are not shown in IDEs (I haven't checked that) then one could argue to simply move them over to the stub files all together, but probably in a future PR.

I guess it is up to the maintainers to decide such things.

headtr1ck avatar Jun 24 '24 15:06 headtr1ck

I think having the docstrings in two places sounds like a bad idea. This would require some CI checks to test if they are the same etc and would make future PRs complicated.

I agree completely that they should not be in two places in the netCDF4-python repo. That's why I settled on the idea that the script should be run by users post-install, and only if desired. Even that does present the risk that a user might not re-run the script on updating netCDF4 and have outdated docstrings in their IDE, but that seems fairly low-risk for the benefit (IMO).

If the current docstrings are not shown in IDEs (I haven't checked that) then one could argue to simply move them over to the stub files all together, but probably in a future PR.

I guess it is up to the maintainers to decide such things.

No, I don't think so, because then the docstrings would be lost from the runtime and, therefore, the REPL. As far as I can tell, the docstrings do belong in the source.

If, down the road, the API and Cython implementation were ever split, then all the doctrings and annotations could live in the API (.py file).

RandallPittmanOrSt avatar Jun 24 '24 15:06 RandallPittmanOrSt

I would say, we focus on typing first in this PR and then open a separate issue for the docstrings. Maybe there is a way to copy over the docstrings when building wheels, so at least there is only a single source?

headtr1ck avatar Jun 24 '24 19:06 headtr1ck

Variable type

I'd like to address the question of Variable types and making Variable generic in its own comment.

I think there are some questions we need to try to answer:

  1. What is the benefit of making Variable generic?
    • We probably don't want to use the type in __getitem__() because we don't know if the return type will be an array or a scalar value--it depends on masking configuration of the Dataset and dimensions of the array and the key.
    • We probably don't want to use the type in getValue() because it could return either a scalar or a masked array.
    • It doesn't make sense to use the type in __setitem__ because __setitem__ is very generous with what types it accepts.
    • If we know the type then we can (at least it some/most cases?) annotate the .datatype and .dtype properties... with the caveat that propertly annotating @overload on properties is a bit of an unknown (see here and here)
  2. What are allowable types for Variable (What bound for VarT = TypeVar("VarT", bound=???) used for class Variable(Generic[VarT]): ..., if anything)?
    • CompoundType, VLType, and EnumType are not valid because only an instance of any of those types can be the type of a Variable.
    • Does it make more sense to use str or np.str_? Or neither, since it's stored as a VLType in the file? str is the type of returned scalars and class the VLType uses.
    • To me, (if we are specifying bound=) it does not make sense to allow any type literals ("f4", "S1", etc.) as those are not types in and of themselves.

I did some research and got this table of results:

createVariable datatype parameter Variable.datatype Variable.dtype Notes
Literal for a scalar numeric ("i4") except complex types np.dtype of that type np.dtype of that type Cannot infer statically without @overload for every single literal of that type
Literal "c8" or "c16" CompoundType instance np.dtype('complex64') or np.dtype('complex128') Cannot infer statically without @overload for every single literal of that type
numpy scalar numeric type (np.int8) or scalar numeric dtype (np.dtype(np.uint16)) except complex types np.dtype of that type np.dtype of that type Can infer statically with a generic @overload
numpy complex scalar numeric type (np.complex64) or scalar numeric dtype (np.dtype(np.complex64)) CompoundType instance np.dtype('complex64') or np.dtype('complex128') Can infer statically with a generic @overload. datatype overload must be separate
int np.dtype(int64) np.dtype('int64') Could add an @overload for this. Is np.dtype(int) == np.dtype('int64') on all platforms?
float np.dtype(float64) np.dtype('float64') Could add an @overload for this. Is np.dtype(float) == np.dtype('float64') on all platforms?
str or np.str_ or dtype(str) or dtype(np.str_) VLType instance with .name is None and .dtype == str str Could be provided by a specific @overload. I think the Variable type should be str.
"S1" or "c" np.dtype('S1') np.dtype('S1') Could maybe use np.bytes_ for the type??
"S2" or any other number, "U" or "U1" or U with any number, or np.dtype with those literals VLType instance with .name is None and .dtype == str str Undocumented. Don't include in stubs except allowing for in an @overload with Any
CompoundType instance CompoundType instance structured array dtype I don't think this can be annotated
EnumType instance EnumType instance dtype of the instance I don't think this can be annotated
VLType instance VLType instance dtype of the instance I don't think this can be annotated

The following cannot be used as datatype parameters:

  • np.bytes_
  • Other np.float or np.complex precisions.
  • bool or np.bool_
  • Strutured np.dtype -- e.g. st1 = np.dtype([("x", np.uint8), ("y", np.float32)]). A CompoundType must be created first.

Based on all of the above, I propose the following truncated implementation: https://gist.github.com/RandallPittmanOrSt/ce695f0b7d7717493649909d1926314a

Edited to add: ~For better or worse, it turns out that the various Literals I proposed are of no use, unless we specifically annotate the @overload for each and every one.~ Nevermind, I think I can work around this.

RandallPittmanOrSt avatar Jun 24 '24 19:06 RandallPittmanOrSt

I will add, I didn't have any problems with my stub file working under Python 3.8, in fact I developed it under 3.8. I did have to get Buffer, Self, and TypeAlias from typing_extensions as you did, but otherwise followed the docs and had no issues: https://typing.readthedocs.io/en/latest/source/stubs.html. I think yours should work fine under 3.8.

RandallPittmanOrSt avatar Jun 24 '24 23:06 RandallPittmanOrSt

@jswhit any reason, __has_parallel_support__ and __has_ncfilter__ were not imported at netCDF4 level (only available in netCDF4._netCDF4? I imported them now, but can undo that.

No reason, just an oversight on my part. OK to fix.

jswhit avatar Jun 25 '24 18:06 jswhit

I would say I leave it like this. The Variable proposal with all the Descriptors looks quite hacky and is beyond my knowledge. I propose to merge as is, and @RandallPittmanOrSt can open a follow-up PR to improve things.

headtr1ck avatar Jun 26 '24 20:06 headtr1ck

stubtest is failing

jswhit avatar Jun 27 '24 01:06 jswhit

stubtest is failing

Can you try again please?

headtr1ck avatar Jun 27 '24 05:06 headtr1ck

stubtest is failing

Can you try again please?

still failing

jswhit avatar Jun 27 '24 19:06 jswhit

waiting for @RandallPittmanOrSt to complete his review

jswhit avatar Jul 06 '24 16:07 jswhit

@headtr1ck what is the procedure for updating this when a new function is added to the API (for example, PR #1348)?

jswhit avatar Jul 07 '24 14:07 jswhit