netcdf4-python
netcdf4-python copied to clipboard
Add static typing support
Closes: #1298
Lots of this PR was stolen from https://github.com/Unidata/netcdf4-python/pull/1279
Notice: I intentionally reversed the import behavior of __init__.py
and _netCDF4.pyx
in the stubs. This should be more in line in how people are using netCDF4. And while doing so I also removed the "._netCDF4" part of the classname in the reprs.
The current typing is working with python 3.11. TODO:
- [x] ~~Make typing pass with python 3.8 (aka. minimum supported version)~~ Won't do
Thanks! Looks good - but why did you add a new workflow to run stubtest, instead of adding the extra test to the existing workflows?
why did you add a new workflow to run stubtest, instead of adding the extra test to the existing workflows?
Good question. I just took the changes from the previous PR #1279
After the holidays I can continue to work on this and use the existing workflows.
Looks you like need to add git submodule update --init --recursive
after the git clone
in stubtest.yml
.
I was the author of #1279 , something went wrong with my pr and when I noticed this one was already made. Which is perfectly fine but now this pr has been dead for 5 months. The only problem I see now with this pr is that mypy was not installed in the the workflow. Could this just be added and move on? and merge latest version in this branch
Hey, I was just very busy the last months. I can try to get it to work the next days.
And thanks for your initial work, it was very helpful to get started.
I have now made Variable a generic class to solve the datatype problem.
Unfortunately, at least for me, it does not seem to be able to resolve the non-np.dtype np.DTypeLikes.
So if you as datatype e.g. netCDF4.VLType
or np.int32
it works, but "i4"
or "float32"
will not resolve to np.dtype
but to their Literals. I have not been able to solve this problem.
But I think it is already better than before and should also not block merging this PR. We can always fix this in the future as it is a minor problem.
One sad thing, is that we have to return Any for __getitem__
at Dataset/Group, because statically we cannot know if it returns a group or a variable. I don't think there is any solution for that.
One sad thing, is that we have to return Any for
__getitem__
at Dataset/Group, because statically we cannot know if it returns a group or a variable. I don't think there is any solution for that.
Wouldn't a Union work?
def __getitem__(self, elem: str) -> Union[Group, Variable]: ...
Wouldn't a Union work?
def __getitem__(self, elem: str) -> Union[Group, Variable]: ...
Technically yes, but this will result in many false positive errors from mypy on user side. The recommended solution for that is to return Any and never return Unions.
A sidenote:
- I have removed the
_netCDF4
module path from the reprs, I doubt that should affect anyone but one could consider this a breaking change. - I was adding
always_mask
to private_attrs, because I think it is correct after reading the C code (As well as to type it as read-only property). - Does anyone have objections on the matter that from a typing perspective the classes live in the
__init__
and not in the_netCDF4
module?
Other than that, this is good to go :)
A sidenote:
- I have removed the
_netCDF4
module path from the reprs, I doubt that should affect anyone but one could consider this a breaking change.- I was adding
always_mask
to private_attrs, because I think it is correct after reading the C code (As well as to type it as read-only property).- Does anyone have objections on the matter that from a typing perspective the classes live in the
__init__
and not in the_netCDF4
module?Other than that, this is good to go :)
@headtr1ck please see my questions in the review
@headtr1ck please see my questions in the review
I don't see a review? Are you sure you published it?
Don't know if anyone noticed, but dtype_is_complex
seems to be wrong? It only works if you pass string literals like "c8"
or "c16"
, but not if you pass a numpy dtype object like np.dtype("c8")
.
Edit: nvmd, actually np.dtype("c8") == "c8"
returns True
. Disregard my comment
@headtr1ck please see my questions in the review
I don't see a review? Are you sure you published it?
oops, just submitted it
@jswhit any reason, __has_parallel_support__
and __has_ncfilter__
were not imported at netCDF4
level (only available in netCDF4._netCDF4
?
I imported them now, but can undo that.
Ugh, I wish I'd found this PR sooner. I've been working on the same thing but as a stubs package. Looks like I have some things to learn from your PR, like returning Any
from __getitem__
rather than a Union
.
One fun side-quest I took was writing a script with libcst to enable users to import the docstrings into the stub file.
Ugh, I wish I'd found this PR sooner. I've been working on the same thing but as a stubs package. Looks like I have some things to learn from your PR, like returning
Any
from__getitem__
rather than aUnion
.
That's very unfortunate, bad timing, haha. After so many years to start this project at the same time. It seems that you also have some good ideas of better supporting numpy arrays with dtypes. Maybe we can merge these things somehow?
One fun side-quest I took was writing a script with libcst to enable users to import the docstrings into the stub file.
I don't really know what the advantages are of having the docstrings in the stub files. Do you have an example?
@headtr1ck
Maybe we can merge these things somehow?
Yes. I have gone through our two stubs file side-by-side on paper and will make a few comments here today. I found some fairly minor errors/omissions, I have some thoughts on making Variable
generic, and I have some thoughts on some of the TypeAlias
es.
I don't really know what the advantages are of having the docstrings in the stub files. Do you have an example?
Primarily pop-up help in IDEs, e.g. VSCode. Of course one can use help()
in a REPL but in my mind having both the types and the docstring available in the IDE is nice, especially for obtuse methods you may not come across frequently. Perhaps "discoverability" is the buzzword. In any case, I decided that integrating the docstrings into the stubs should be done programmatically (keeping stubs up-to-date with code is enough of a chore without wading through all the docstrings). I tried making it a build step for my stubs package but it depends on which netCDF4
version the user has and that was too confusing. So I decided just to make it a script that came with the stubs so users could run it at-will if they wanted the docstrings.
If the netCDF4-python maintainers want that script they are welcome to it but I'm fine with just releasing it as a standalone tool (adapted for these integrated stubs).
I think having the docstrings in two places sounds like a bad idea. This would require some CI checks to test if they are the same etc and would make future PRs complicated.
If the current docstrings are not shown in IDEs (I haven't checked that) then one could argue to simply move them over to the stub files all together, but probably in a future PR.
I guess it is up to the maintainers to decide such things.
I think having the docstrings in two places sounds like a bad idea. This would require some CI checks to test if they are the same etc and would make future PRs complicated.
I agree completely that they should not be in two places in the netCDF4-python repo. That's why I settled on the idea that the script should be run by users post-install, and only if desired. Even that does present the risk that a user might not re-run the script on updating netCDF4 and have outdated docstrings in their IDE, but that seems fairly low-risk for the benefit (IMO).
If the current docstrings are not shown in IDEs (I haven't checked that) then one could argue to simply move them over to the stub files all together, but probably in a future PR.
I guess it is up to the maintainers to decide such things.
No, I don't think so, because then the docstrings would be lost from the runtime and, therefore, the REPL. As far as I can tell, the docstrings do belong in the source.
If, down the road, the API and Cython implementation were ever split, then all the doctrings and annotations could live in the API (.py
file).
I would say, we focus on typing first in this PR and then open a separate issue for the docstrings. Maybe there is a way to copy over the docstrings when building wheels, so at least there is only a single source?
Variable type
I'd like to address the question of Variable
types and making Variable
generic in its
own comment.
I think there are some questions we need to try to answer:
- What is the benefit of making
Variable
generic?- We probably don't want to use the type in
__getitem__()
because we don't know if the return type will be an array or a scalar value--it depends on masking configuration of theDataset
and dimensions of the array and the key. - We probably don't want to use the type in
getValue()
because it could return either a scalar or a masked array. - It doesn't make sense to use the type in
__setitem__
because__setitem__
is very generous with what types it accepts. - If we know the type then we can (at least it some/most cases?) annotate the
.datatype
and.dtype
properties... with the caveat that propertly annotating@overload
on properties is a bit of an unknown (see here and here)
- We probably don't want to use the type in
- What are allowable types for
Variable
(Whatbound
forVarT = TypeVar("VarT", bound=???)
used forclass Variable(Generic[VarT]): ...
, if anything)?-
CompoundType
,VLType
, andEnumType
are not valid because only an instance of any of those types can be the type of aVariable
. - Does it make more sense to use
str
ornp.str_
? Or neither, since it's stored as aVLType
in the file?str
is the type of returned scalars and class theVLType
uses. - To me, (if we are specifying
bound=
) it does not make sense to allow any type literals ("f4"
,"S1"
, etc.) as those are not types in and of themselves.
-
I did some research and got this table of results:
createVariable datatype parameter |
Variable.datatype |
Variable.dtype |
Notes |
---|---|---|---|
Literal for a scalar numeric ("i4" ) except complex types |
np.dtype of that type |
np.dtype of that type |
Cannot infer statically without @overload for every single literal of that type |
Literal "c8" or "c16" |
CompoundType instance |
np.dtype('complex64') or np.dtype('complex128') |
Cannot infer statically without @overload for every single literal of that type |
numpy scalar numeric type (np.int8 ) or scalar numeric dtype (np.dtype(np.uint16) ) except complex types |
np.dtype of that type |
np.dtype of that type |
Can infer statically with a generic @overload |
numpy complex scalar numeric type (np.complex64 ) or scalar numeric dtype (np.dtype(np.complex64) ) |
CompoundType instance |
np.dtype('complex64') or np.dtype('complex128') |
Can infer statically with a generic @overload . datatype overload must be separate |
int |
np.dtype(int64) |
np.dtype('int64') |
Could add an @overload for this. Is np.dtype(int) == np.dtype('int64') on all platforms? |
float |
np.dtype(float64) |
np.dtype('float64') |
Could add an @overload for this. Is np.dtype(float) == np.dtype('float64') on all platforms? |
str or np.str_ or dtype(str) or dtype(np.str_) |
VLType instance with .name is None and .dtype == str |
str |
Could be provided by a specific @overload . I think the Variable type should be str . |
"S1" or "c" |
np.dtype('S1') |
np.dtype('S1') |
Could maybe use np.bytes_ for the type?? |
"S2" or any other number, "U" or "U1" or U with any number, or np.dtype with those literals |
VLType instance with .name is None and .dtype == str |
str |
Undocumented. Don't include in stubs except allowing for in an @overload with Any |
CompoundType instance |
CompoundType instance |
structured array dtype |
I don't think this can be annotated |
EnumType instance |
EnumType instance |
dtype of the instance |
I don't think this can be annotated |
VLType instance |
VLType instance |
dtype of the instance |
I don't think this can be annotated |
The following cannot be used as datatype
parameters:
-
np.bytes_
- Other
np.float
ornp.complex
precisions. -
bool
ornp.bool_
- Strutured
np.dtype
-- e.g.st1 = np.dtype([("x", np.uint8), ("y", np.float32)])
. ACompoundType
must be created first.
Based on all of the above, I propose the following truncated implementation: https://gist.github.com/RandallPittmanOrSt/ce695f0b7d7717493649909d1926314a
Edited to add: ~For better or worse, it turns out that the various Literal
s I proposed are of no use, unless we specifically annotate the @overload
for each and every one.~ Nevermind, I think I can work around this.
I will add, I didn't have any problems with my stub file working under Python 3.8, in fact I developed it under 3.8. I did have to get Buffer
, Self
, and TypeAlias
from typing_extensions
as you did, but otherwise followed the docs and had no issues: https://typing.readthedocs.io/en/latest/source/stubs.html. I think yours should work fine under 3.8.
@jswhit any reason,
__has_parallel_support__
and__has_ncfilter__
were not imported atnetCDF4
level (only available innetCDF4._netCDF4
? I imported them now, but can undo that.
No reason, just an oversight on my part. OK to fix.
I would say I leave it like this. The Variable proposal with all the Descriptors looks quite hacky and is beyond my knowledge. I propose to merge as is, and @RandallPittmanOrSt can open a follow-up PR to improve things.
stubtest is failing
waiting for @RandallPittmanOrSt to complete his review
@headtr1ck what is the procedure for updating this when a new function is added to the API (for example, PR #1348)?