pyjanitor
pyjanitor copied to clipboard
[TST] Add a type-checking linter to the CI build?
Like mypy?
@ericmjl, @szuckerman
Looking into it on a branch. Starting with a mypy configuration file.
I'm not quite sure what to look for here, though, and hence what to enforce in CI. What are your thoughts?
I think the important bit is that no errors from mypy appear when:
- a function calls another function with arguments of the wrong type when types are specified
- a function returns objects that don't match its return type specification
I don't know how crazy you can get with mypy config, but I think the more strict you get with it, the greater chance it'll get hung up on non-important edge cases.
Also note that adding static type checking in no way necessitates that we should then mandate specifying types for all arguments & return values across all functions. In fact, it means we should be thoughtful not to pigeonhole a particular parameter into corresponding to a particular type when just freely allowing any type would have been much more useful for the user.
Probably just have to experiment and see what happens.
Edit: also just about to go through and add missing dataframe return types and the like
Case in point:
@pf.register_dataframe_method
@deprecated_alias(columns="column_names")
def get_dupes(df: pd.DataFrame, column_names=None) -> pd.DataFrame:
"""
Return all duplicate rows.
Functional usage example:
.. code-block:: python
df = pd.DataFrame(...)
df = get_dupes(df)
Method chaining example:
.. code-block:: python
import pandas as pd
import janitor
df = pd.DataFrame(...).get_dupes()
:param df: The pandas DataFrame object.
:param str/iterable column_names: (optional) A column name or an iterable
(list or tuple) of column names. Following pandas API, this only
considers certain columns for identifying duplicates. Defaults to using
all columns.
:returns: The duplicate rows, as a pandas DataFrame.
"""
column_names
don't really have to be iterables of strings, though they most often are. In cases like this, I often annotate like: Union[Type1, Type2, Any]
, which is therefore useful in a documentation sense, but not a type checker one.
- I also saw in a few places (11 times in functions.py to be pedantic) arguments annotated with
str
but initialized toNone
, which from the signature point of view makes it unclear if they are optional. (though to be honest the full docstring makes it clear in most cases) - I would add also that good signatures might also prove useful for autocompletion purposes.
Thanks @VPerrollaz for going through the codebase!
I also saw in a few places (11 times in functions.py to be pedantic) arguments annotated with str but initialized to None, which from the signature point of view makes it unclear if they are optional. (though to be honest the full docstring makes it clear in most cases)
Yes, they should be annotated as Optional[str]
, not just str
. Would you be open to a PR that addresses this?
I can go for a PR, but are we sure we should not do something more systematic regarding type annotations?
What I mean is that mypy
signals a lot of stuff (though I believe a lot of it is just noise). See below for the result of functions.py
analysis.
io.py:5: error: Skipping analyzing 'pandas': found module but no type hints or library stubs
io.py:7: error: Skipping analyzing '.errors': found module but no type hints or library stubs
io.py:8: error: Skipping analyzing '.utils': found module but no type hints or library stubs
io.py:29: error: Argument 1 to "len" has incompatible type "Union[str, Iterable[str]]"; expected "Sized"
functions.py:22: error: Skipping analyzing 'numpy': found module but no type hints or library stubs
functions.py:22: note: See https://mypy.readthedocs.io/en/latest/running_mypy.html#missing-imports
functions.py:23: error: Skipping analyzing 'pandas': found module but no type hints or library stubs
functions.py:24: error: Skipping analyzing 'pandas_flavor': found module but no type hints or library stubs
functions.py:25: error: Skipping analyzing 'natsort': found module but no type hints or library stubs
functions.py:26: error: Skipping analyzing 'pandas.api.types': found module but no type hints or library stubs
functions.py:27: error: Skipping analyzing 'pandas.errors': found module but no type hints or library stubs
functions.py:28: error: Skipping analyzing 'scipy.stats': found module but no type hints or library stubs
functions.py:29: error: Skipping analyzing 'sklearn.preprocessing': found module but no type hints or library stubs
functions.py:31: error: Skipping analyzing '.errors': found module but no type hints or library stubs
functions.py:32: error: Skipping analyzing '.utils': found module but no type hints or library stubs
functions.py:657: error: Item "Hashable" of "Union[Iterable[str], Any, Hashable]" has no attribute "__iter__" (not iterable)
functions.py:663: error: Argument 1 to "list" has incompatible type "Union[Iterable[str], Any, Hashable]"; expected "Iterable[Any]"
functions.py:723: error: Value of type "Iterable[Hashable]" is not indexable
functions.py:1102: error: Argument 1 to "len" has incompatible type "Union[List[str], Tuple[str], None]"; expected "Sized"
functions.py:1115: error: Argument 1 to "enumerate" has incompatible type "Union[List[str], Tuple[str], None]"; expected "Iterable[str]"
functions.py:1511: error: Argument 1 to "len" has incompatible type "Iterable[Any]"; expected "Sized"
functions.py:2049: error: Argument 1 to "range" has incompatible type "Optional[int]"; expected "int"
functions.py:2171: error: No overload variant of "round" matches argument types "Any", "float"
functions.py:2171: note: Possible overload variants:
functions.py:2171: note: def round(number: float, ndigits: None) -> int
functions.py:2171: note: def round(number: float, ndigits: int) -> float
functions.py:2171: note: <2 more similar overloads not shown, out of 6 total overloads>
functions.py:2269: error: Incompatible types in assignment (expression has type "Hashable", variable has type "Optional[str]")
functions.py:3579: error: Incompatible types in assignment (expression has type "List[Hashable]", variable has type "Union[str, Iterable[str], Hashable, None]")
functions.py:3587: error: Item "Hashable" of "Union[Any, Hashable]" has no attribute "__iter__" (not iterable)
functions.py:3807: error: Argument 1 to "len" has incompatible type "Iterable[Any]"; expected "Sized"
functions.py:3810: error: Value of type "Iterable[Any]" is not indexable
Found 27 errors in 2 files (checked 1 source file)
w.r.t. type annotations, I haven't been insistent as one might desire, because I was in the process of learning about it too. Gradually typing the codebase as we go along is I think generally accepted as "ok practice", so I'm happy to for this process to go slowly, unless there's someone who is willing to do it over a few PRs (not unlike what I did, a repetitive task for the matplotlib library in 2015).
Ok I can start with the Optional[str]
then.
But the question will end up knowing when to say good enough, since I believe mypy generates a lot of noise (for instance all the Skipping analyzing
).
Ok while preparing the PR I stumbled across a few situations where the pattern : str = None
is misleading because the argument is actually not optional. Here is the list of such functions
-
functions.expand_grid
,df_key
not modified since it is not optional : the code raises an exception ifNone
! -
engineering.convert_units
. -
finance._convert_currency
. -
finance.convert_currency
. -
finance._inflate_currency
. -
finance.inflate_currency
.
Furthermore I actually realized that mypy when encountering the argument signature : str = None
actually infers Optional[str]
, so I now wonder if we really shouldn't change the code in the opposite direction and remove = None
where the argument is not optional.
Finally I realized that the signature pattern without Optional
but None
initialization is of course not only present with str
do you want me to change every one I saw?
@VPerrollaz that's a good catch! I can't seem to think of a good reason why we would make it None
and then check to see if it is None, when in fact it should be a mandatory argument. (Unless it was set to None to make the reading semantics of the function a bit better, but even then...)
Finally I realized that the signature pattern without Optional but None initialization is of course not only present with str do you want me to change every one I saw?
Yes please. That would bring some consistency to the codebase. (I think I've been doing most of the review on my own, and hitting the big green button as long as tests are passing, because sometimes my brain is mush -- which is why I am going to start asking for two reviews before PR merge!)
Btw, would you like to join the dev team? The consistency in contributions is something I very much appreciate.
First of all I'm definitely interested in joining, but given how little experience I have with such project I'm not sure how much help I would be so let me know what you think.
Back to the signatures, as I see it there is a need for at least three PR.
- One for adding Optional where it should be.
- One for correcting a few other signature inconsistencies I have seen. (For instance in some places Iterable should become Sequence when use an index on the argument)
- Finally one to remove the
None
defaults where the argument is not optional. I'm a bit worried here actually, there might be a need to add some test to guarantee that no harm is done. (So I might have to split it further)
For the second type of PR I have a slight problem which is that the module typing
is rapidly evolving (for the better) with the python version, so I'm not entirely sure what to go for.
For instance in the function.clean_names
the argument strip_underscores
was first typed str
, is now Optional[str]
but looking at the documentation the best type would actually be
Literal[None, 'left', 'l', 'right', 'r', 'both', True]
but of course the problem is that this is a python 3.8 feature so I'am a bit at a loss. We clearly shouldn't use it at this point but should I leave a TODO comment?
Also looking the the str
, Hashable
and other further types used for column_names, I wonder if it would not be interesting to introduce a custom TypeVar or NewType with a well chosen name.
First of all I'm definitely interested in joining, but given how little experience I have with such project I'm not sure how much help I would be so let me know what you think.
It's as much or as little as you'd like. There's an open issue for this #672, where you can see the discussion going on. I'd love for dev team members to be able to have the benefit of being a recognized open source contributor + leader of some kind.
We clearly shouldn't use it at this point but should I leave a TODO comment?
Yes, that sounds like a sane thing to do. :smile:
Also looking the the str, Hashable and other further types used for column_names, I wonder if it would not be interesting to introduce a custom TypeVar or NewType with a well chosen name.
Could be good! If you can come up with a simple new typing type, then that'd be good. I'd also be cognizant of keeping the codebase simple to contribute to! (Keeps the project friendly for newcomers.)
I have a branch (mypy-configuration) in my fork where I created
- a mypy.ini configuration file
- a make target (typecheck) for automatic typechecking
- multiple commits to resolve all the "errors" signaled by mypy.
I have to say except for all the
Skipping analyzing 'xxxxx': found module but no type hints or library stubs
(which I "solved" through the config file) all the other stuff were rather reasonable.
It is clearly totally experimental (in particular I think mypy is very permissive at this point) but it might ground the discussion more concretely.
@VPerrollaz this all sounds great. Thank you for taking the initiative! I definitely wouldn't have been able to handle it on my own.
Would you be open to putting in a PR? If you do so, I'll PR into your branch additional changes to make to the CI pipelines to make sure they run.
Of course but to be clear I put a lot of "extra stuff" in it (for resolving the typing errors) so we will most likely have to cherry pick what we feel comfortable using. And the configuration of mypy can clearly be further refined.
@VPerrollaz just wanted to tighten up loose ends here, and I'd love to have your input here: What's do you think should happen next to close out this issue? I've a preference that we move things forward with a PR, and hash out the details on there, as reviewing an entire branch is not as easy as seeing the diff.
Just to be clear do you want a PR in the dev branch or another specific one? In the second case I'm all for it, in the first I'm not so sure for the following reasons :
- I had to do "real" code change (as opposed to just signature manipulation) to solve some typing errors. They are not ground breaking but I have tried to isolate those in specific commit so that they can be further checked.
- the mypy configuration should be further refined and we should properly spell out our intent for using mypy. I'm thinking for instance of the use of Any which is in some cases the easy way out to resolve typing error but not so helpful.
@VPerrollaz got it, I see the concerns you have there.
Regardless, I think a PR to dev
would be great! Here’s how I think we can make it work.
- We can start with a PR that does not involve code. i.e. first write out the intent for using mypy in a docs page. (Feel free to pick an appropriate place, I don’t have a good idea on this right now.) This one should be closed as fast as soon as it’s ready. A well-written doc that explains the why + how in a way that newcomers can understand would be the best! We can iterate on it over a PR.
- We can then do a second PR that adds the necessary config files to pass the code through mypy checks.
- Following that, we can slowly hack away at satisfying mypy. I know that mypy will probably catch a lot of issues in the codebase. A long-lived branch that merges often from
dev
is appropriate. - Once all of the issues that mypy finds are fixed, we can merge.
- Following that, we can slowly hack away at satisfying mypy. I know that mypy will probably catch a lot of issues in the codebase. A long-lived branch that merges often from
- If there are more PRs that come in with problems caught, we can ask the PR authors to fix up the relevant pieces along the way, like just sneaking in that piece into the PR.
What do you think? Feel free to correct me if you see issues with the plan!
I may have missed a bit while skimming through, but is the goal to only use this in CI, or to also add to the make
file and/or pre-commit
?
@hectormz I think the very first goal would be to get mypy checks executed on CI.
@ericmjl I think this plan looks very promising, in particular it should minimize the possibility of bad surprises. As for the reasons for using mypy, so far what I see
- some obvious bug catches between function calls (though that is related also to our policy vis-à-vis the use of Any)
- some dubious logic are caught my tracing the variables inside a function
- some kind of documentation/autocompletion information which are "garanteed" relevant thanks to the CI I'm hesitating on the amount of details (or even examples) to provide in the documentation in addition to the place to put them. If you have any thoughts...
@hectormz I think that if we put mypy in CI, adding a make target seems reasonable. I'm not so sure for pre-commit though, it seems more PR related to me.