pvlib-python
pvlib-python copied to clipboard
use array_like instead of numeric/array-like
The pvlib parameters/returns type documentation style is:
- numeric : scalar, np.array, pd.Series. Typically int or float dtype.
- array-like : np.array, pd.Series. Typically int or float dtype.
However, Numpy defines array_like as "Any sequence that can be interpreted as an ndarray. This includes nested lists, tuples, scalars and existing arrays."
I think we should follow the numpy standard. We lose the implied specification for whether or not a function works with scalar input, but this seems to be a source of confusion anyways. Instead, the docstrings for functions that require array input with length > 1 could explicitly state this.
I think that using array-like
in a different way from numpy is probably not such a good idea, but then perhaps we should just not use array-like
at all? I have the impression that there are lots of pvlib
functions that support only a subset of array-like
.
I would prefer pvlib documentation to be more explicit, which might mean putting more than one word, e.g.:
- scalar
- list
- scalar or list
Hey, @adriesse @wholmgren I am new to this library and would like to contribute. I have experience with documentation in sphinx. Kindly guide me in solving this issue. Also, guide me on whether should I start working with this issue or with issue #476 ?
@percygautam thanks for your interest! Maybe you could run a couple of find/replace-all commands.
Note the contributing documentation contains some general suggestions for getting started with git/github.
Thanks @wholmgren , I'll start exploring the documentation and would start working on issues.
There are performance and syntactic differences between a ndarray of shape (1, 1, ...) and a python scalar, so I don't think the terms are interchangeable if numeric includes true native scalars
In [1]: import numpy as np
In [2]: a = np.array(1.23)
In [3]: a
Out[3]: array(1.23)
In [4]: a.shape
Out[4]: ()
In [5]: a.size
Out[5]: 1
In [6]: np.isscalar(a)
Out[6]: False
In [7]: b = 2.34
In [8]: c = np.array([3.45])
In [9]: np.isscalar(b)
Out[9]: True
In [10]: np.isscalar(c)
Out[10]: False
In [11]: b.shape
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-11-a8244287b6af> in <module>
----> 1 b.shape
AttributeError: 'float' object has no attribute 'shape'
In [12]: c.shape
Out[12]: (1,)
The numpy definition of array_like
is now:
Any scalar or sequence that can be interpreted as an ndarray. In addition to ndarrays and scalars this category includes lists (possibly nested and with different element types) and tuples. Any argument accepted by numpy.array is array_like.
pvlib's definition of numeric
is a subset of numpy's definition of array_like
. The major difference I see is that array_like
includes lists, whereas numeric
does not. They both include native scalars. pvlib's definition of array-like
excludes scalars.
I just did a quick search and there seem to be only 4 occurrences of array_like
in the code vs. 22 occurrences of array-like
. Note the difference between dash and underscore. Presumably the former is the numpy one and the latter the pvlib one.
As @adriesse pointed out in an earlier comment, it's confusing to use array_like
differently to how numpy does it. I think it's actually even more confusing given the subtle and almost undetectable (by a human) difference between the underscore version and the dash version.
Some further investigation reveals that all 4 usages of the underscore version array_like
are in the same function, namely fit_sandia
. I'm not too familiar with Python's array indexing, but looking into that function, there's a lot of array indexing of the parameters that are declared as array_like
. Since the numpy definition includes scalars, as @wholmgren pointed out the last comment, I wonder will there be an error if someone tries to input a scalar to that function due to the fact that the function will try to index the scalar? For example, if we entered a value of 7
for dc_voltage
, how would this line behave?
v_d = np.array(
[dc_voltage[dc_voltage_level == 'Vmin'].mean(),
dc_voltage[dc_voltage_level == 'Vnom'].mean(),
dc_voltage[dc_voltage_level == 'Vmax'].mean()])
If that would cause an error, then maybe the types of those variables should be the PVLib array-like
vs the numpy array_like
?
Actually I just went ahead and tried it. Yes there seems to be an error if you input all scalars:
fit_sandia(7, 7, 7, 7, 7, 7)
The error is at this line:
[dc_voltage[dc_voltage_level == 'Vmin'].mean(),
And it says:
TypeError: 'int' object is not subscriptable
Which makes sense. So I'm thinking those 4 and only usages of the underscore (numpy) version array_like
can at least be refactored to the dash (PVLib) version array-like
.
EDIT: This still doesn't solve this issue, but at least it normalises all usages so that we don't have both array_like
and array-like
in play. So it might be a good start to just do that refactor now.
@ColmBhandal thanks for this report.
As author of that function, I am 99.99% certain that I was not mindful of the distinction between array_like
and array-like
; most likely, it was a typo on my part.
That function should not work with scalar inputs. There probably is a minimum length of each subset dc_voltage[dc_voltage_level == <value>]
in order to get useful output, but I would distinguish that minimum from requiring array-like
data. If a user passes arrays of length 1, the function should run to completion, but the output will be garbage.
Good idea to take a small bite out of this one @ColmBhandal !
Perhaps numpy is not the best example to follow. I suspect pvlib has a lot more variety in the types of function arguments overall, and I also suspect pvlib functions are far from uniform in their support for potentially equivalent data structures. So I will stick with my initial comment that it is better for pvlib to be explicit in the documentation as to the data types and data structures required or supported by each function.
We could supplement this with programming (not documentation) guidelines on what range of equivalent structures pvlib functions should strive to support. Personally I often throw in an np.asanyarray() in the beginning of functions, which saves me a lot of pain; but it's not always the right thing to do. Come to think of it, I may be following examples from numpy on that technique...
@adriesse I tend to agree with you. I think the ideal short-medium term solution is:
- Very short term: merge my little PR above
- Short term: Analyse all usages of (the now uniform)
array-like
and figure out what it really means (there are only like 26 of them I think, so that's doable) - Come up with a new generic descriptor, or multiple new generic descriptors, to suit each of the usages of
array-like
, and this time choose something that doesn't look anything like numpy'sarray_like
(so we avoid typos like that of @cwhanse in future) - Refactor the 26 or so remaining occurrences of
array-like
to use the new descriptors
That's an achievable short/medium term solution, as long as we choose descriptors that don't clash with numpy. We can then leave the discussion of whether we want to use numpy descriptors to a separate issue / future discussion and not let it get in the way of resolving this issue.
FWIW: I've been tripped up in the past by numpy
's own scalar type. numpy
also simplifies greatly from not having to support pandas
.
I have been thinking about this and I wonder if the long term solution is static duck typing i.e. structural types. Minimally, static duck typing will provide a much clearer documentation of the expected type via a type annotation vs. a documentation comment which is subject to typos.
Down the line, static duck typing can also be coupled with static analysers which can be run as part of the standard build steps to ensure all statically duck typed portions of the code are used in a type safe way.
The problem with Python's standard dynamic duck typing is that you don't know that something is not a duck until you ask it to quack, at runtime, and it says "sorry I don't know how", possibly deep into the function. Of course, that's what these documentation descriptors try to tackle. But it's up to a human to write and interpret them. There's no machine support (or is there?)
The nice thing about static duck typing is that you still don't have to nominally type everything you put into a function - which violates the Python ethos. But by statically declaring the structural type (protocol) that a function accepts, you define a clear contract to users. So it's a good balance between strictly nominally typed languages like Java and untyped languages like Python.
Minimally, static duck typing will provide a much clearer documentation of the expected type via a type annotation vs. a documentation comment which is subject to typos.
But by statically declaring the structural type (protocol) that a function accepts, you define a clear contract to users.
I confess I have only briefly skimmed PEP 544, but I don't see how static duck typing is clearer to the user than docstring comments are. From the user's perspective, isn't it essentially just moving numeric
from the docstring to the signature? It might enable users to use static analysis tools to validate their code that uses pvlib, but I'm skeptical more than a tiny fraction of users would be interested in that. Maybe a concrete example would help me understand why static duck typing is better :)
@kanderso-nrel Definitely an example, or many, would be needed to justify such a significant introduction to the code base. I was holding off on the above comment for the past few days as I wanted a solid example to justify it, but I may or may not have time, so figured I'd just throw the idea out there. Certainly if I have time I'd be passionate about exploring this opportunity more, with concrete code examples. And if I do I'll post my findings here :)
sorry for commenting glibly, but it seems to me the heart of this issue comes down to 2 concerns:
- pvlib does not include native python sequences like lists, tuples, sets, in either "array-like" or "numeric" while numpy does
- pvlib documentation style doesn't handle scalars consistently, and possible defining what a scalar is -- is it a native python numeric type like
int
orfloat
and the numpy equivalentsnp.int32
ornp.float64
or does it also include ndarrays that are size 0, 1, or any dimensioneg:
(0), (1,) or (1,1,...)
native python sequences
I think this only becomes an issue if user tries to invoke a numpy or pandas attribute on a sequence and they get an attribute error. We should be explicit in the documentation if a function requires a pandas series or dataframe, numpy array, or if a sequence will do. I believe pandas series and dataframes have a lot of the same attributes as numpy arrays, and many numpy functions will accept a sequence, so we may need to be diligent to check contributions to see if they will raise an attribute error with a native python sequence. Otherwise, I think that in general it's okay to tell users that pvlib only uses numpy arrays or pandas series and dataframes, and not native python sequences, and just be explicit about that in the docs. I don't know what word we should use to describe such a class of data types. I guess we can't use "array-like" or array_like
because numpy has already defined this to include scalars. if we plan to leave out lists, tuples, etc. However one easy, yet tedious solution to this is to invoke np.asarray()
which will copy the input to a numpy array (replacing scalars and sequences) or do nothing if the input is already an array. Then we can simply leave the type definitions, docs, comments, and hints as the NumPy glossary version. QED
handling scalar ambiguity
Using np.asarray
may resolve this issue as well, but I think this is more of a documentation issue than a programmatic one. We just need to be clear if any length of input including 1 is okay. For input that can only be a scalar, I think the documentation should say which types it should be. I think we have been misusing "numeric" which according to the Python builtin types definition are only int
, float
, & complex
, which by definition is always a scalar only.
So there's my recommendations:
- change our definition of
array_like
to match numpy's - change our definition of
numeric
to match python's - use
np.asarray
to assimilate input toarray_like
for all functions - stress in the documentation that this is a scientific library that deals mostly with arrays and timeseries, and so users will get better mileage by using np.arrays or pandas series and dataframes
now of course this doesn't handle xarray or other "array_like" datatypes, but I think that's out of scope for this issue
I'm coming around to the view that it may be better to just list the specific types in the docstring, e.g.,
x : scalar, np.array, or pd.Series
rather than
x: numeric
where numeric
is defined by a part of the pvlib documentation that a user will likely not have read. Similarly, I don't anticipate pvlib users to know to read numpy
's documentation to know what array_like
means. Of course, my example opens the question: What is scalar
? and perhaps: What is np
?
The ambiguity that arises from discussions like these might be eliminated by structural types aka static duck types, because their semantics are rigorously defined. I agree that it's a safer approach not to assume that users have read the numpy docs when interpreting type declarations in the docs. Structural types may be clearer. I've briefly seen them in action in the Gosu language and I thought they were really cool. I'd love to investigate if they'd be applicable to this problem in Python.
@mikofski I'll be even more glib: "explicit is better than implicit"!
I agree with @cwhanse's comment. The only change I'd make to is to use float
instead of scalar
- except for when int
is required. That avoids the question of python vs numpy scalar. I think the vast majority of people will interpret float
as a python float
and will blindly use int
as well. I can't recall any case when int
is actually required but maybe there are a few.
I'm also all for adding type hints to pvlib but I really think that's a separate conversation because they are not easily integrated with documentation at this point in time.
@wholmgren For what it's worth I think I'm with you that the type hints is probably a separate, much larger thing. For this issue to get closed as a practical matter I think just merging my PR (after I add the tests) then analysing the 26 remaining usages of array_like and updating the doc appropriately should do it. Of course, conversations are likely to follow on what those updates will be.
Questions:
- Is there good reason/established precedent that pvlib's
numeric
andarray-like
includenp.array
but NOTpd.DataFrame
? (see here) - Is there any programming style/practice/expectation/contract around a function/method outputs, e.g., returning a
pd.Series
when apd.Series
was provided? (in cases where this is applicable, and where mixed input types could require some arbitrary convention for the output type)
Readability counts. having to explicitly spell out every type is fine, but a possibly ugly. I don't think it's unreasonable to expect pvlib users to be fluent with numpy. Leveraging the numpy API is a common practice among python packages, and I think we can benefit from it too. IMO array_like
as defined by numpy is just fine.
On the other hand, I think numeric
as pvlib defines is bad. It's defined differently by Python where it has nothing to do with sequences, arrays, or dataframes.
By that same token, I don't think we should use "scalar" as a datatype. I think any function that's meant to be used with arrays will work fine with an array of 1, and any function that requires an int
or float
should just say that explicitly.
We should consider if we want to wrap inputs meant to be arrays with np.asarray
which could solve many of these problems.
Another perspective is that having to explicitly wrap inputs into arrays is ugly. It delegates type checking to the function's computational part rather than its type signature, so it sort of muddies the responsibility of the function body, which is to compute, rather than type check. I'm relatively inexperienced in Python and from what I'm reading it seems like this is common enough practice though, but I think that's just a trade off with having a dynamically duck typed language.
But yeah, the flip side is that types can get pretty ugly especially when they get as complicated as @campanelli-sunpower's second bullet point references eg different input types guaranteeing different outputs.
IMO array_like as defined by numpy is just fine
I'll disagree on this point. Rarely do we design and test pvlib functions to accept any data type included in numpy's array_like
, e.g., list
, tuple
, ndarray
s with several dimensions. Using array_like
and pointing to numpy
implies that pvlib will accept all of these types.
There are many functions which users will want to work with either float
or array
input. At the risk of sidetracking our discussion, I find numpy
to be a bit schizophrenic with length 1 arrays and that's why I favor explicitly adding something like float
. To make a 1d array with a float, the (to me obvious) np.array(5)
does not return what is expected, while np.atleast_1d
does. There could be a better way to do it.
len(np.array(5)) # error
len(np.atleast_1d(5)) # works
I think len
works consistently for either python or numpy:
len(5) # TypeError: object of type 'int' has no len()
len([5]) # 1
However, numpy arrays have size
, shape
, and ndim
attributes which might be more useful than calling len
.
import numpy as np
a = np.array(5)
b = np.atleast_1d(5)
a.size # 1
b.size # 1
a.shape # ()
b.shape # (1,)
a.ndim # 0
b.ndim # 1
I'm relatively inexperienced in Python and from what I'm reading it seems like this is common enough practice though, but I think that's just a trade off with having a dynamically duck typed language.
We're all learning all the time here, so your contributions have been really helpful. For example, I'd never heard of pep 544 or structured typing so thanks for that! In my experience Python does differ significantly from other languages like Java and C## in that
- duck typing is preferred over type-checking, this means that we really don't care if it's the correct type, because if it works then its fine
- we let exceptions get thrown by the calling function rather then wrapping them and handling every exception at the top level - this means that users may have to read thru the stack trace to understand what happened, but when they do find the culprit, it's usually more informative then a wrapper message.
Both of these features are embodied in a pythonic idiom called of EAFP or easier to ask forgiveness than permission. The opposite is look before you leap or LBYL. Microsoft developer Brett Cannon explains this in his post: EAFP versus LBYL.
I think these reasons are what draw analysts to python in the first place, but may cause headaches for software developers and probably cause performance penalties compared to more statically typed languages.
I'm curious to see what's going on in other packages. Surely we're not the only ones to have to handle this. What's going on in statsmodels, astropy, scikit-learn, etc.? I'll have to do some research.
duck typing is preferred over type-checking, this means that we really don't care if it's the correct type, because if it works then its fine
Yep - I figured this was the case reading bits here and there about Python. Typing is quite a subtle issue though and there's a whole spectrum of conventions to choose from. In my mind I've at least identified a series of points along the spectrum:
- On the one hand, you have dynamic duck typing - which seems to be the Python idiom. Here, the function doesn't declare the type it accepts - other than maybe the doc string. The type that the function accepts is implicit in its code. If you put something in and it breaks, then that's because you put the wrong thing into the function. And it'll break exactly at the point where the function expected the type to have some property/method that it didn't have. I guess you could say that's an EAFP approach, in a broad sense, although from what I'm reading EAFP is more about using try-catch blocks instead of if checks.
- Next you have static duck typing (PEP 544). Here, the type is explicit in the function signature rather than implicit in its body. So I guess the difference here is that it's introducing a sort of LBYL mindframe to the typing of the function itself. What struck me though is the doc strings. Thinking about it, I'm not so sure it's an issue for LBYL vs. EAFP- which seems to be more about coding practices within the function body. To me, it's more a matter of: if we're going to go to all the effort of documenting it, why not just type it? I guess the answer to "why not" is again that the types may not be as easily humanly read, and may be cumbersome to develop/maintain. And I think one of the downsides mentioned in PEP 544 was that it can be cumbersome to write implementations for structural types with a lot of members, as there is no way to make members optional, leading to a lot of boilerplate code. And it's yet to be shown concretely whether structural types are even applicable to the problem at hand e.g. are they able somehow to express the idea of
numeric
rigorously? On the flip side, types in the doc strings are subject to human misinterpretation and typos. - Then you have nominal types - now we're squarely in the Java/C# world. Here, not only does the function accepting the types need to declare them, but whoever produces the types must actually construct an instance of the exact type the function expects - not just something that looks like it. Of course, you can use generics to do templates that work across multiple types, and you can always declare stuff as
Object
and cast - but generics have their limitations, and unsafe casting seems to be non-idiomatic in this world and if I see it I usually frown upon it as unsafe. As someone who's been coding in this world for most of my experience, I actually find the restriction to nominal typing often quite annoying. I have no problem with a function being upfront about what it expects, but the fact that the inputs to the explicitly constructed from some statically-declared subtype is restrictive. You need to start writing these wrapper types if you don't have control over types you're accepting from other libraries, for example.
Then to throw a spanner in the works, you have @campanelli-sunpower's mention of type contracts. I can't say that this lies on a strictly linear spectrum of non-typed to nominally typed, but I suppose on the broader spectrum of weakly vs. strongly typed it's definitely on the strongly typed side of things. Now you start talking not just about what type a function accepts and what type it outputs, but you start saying things like: if the function accepts an A
then it returns a B
, but if it gets a C
then it'll return a D
. I'm a big fan of strong typing, but I'm not sure how practical that would be, and if the types being used in the contracts are nominal, then I think it violates the Python ethos of duck typing and suffers the same drawbacks to nominal typing that I mentioned above. Maybe in safety-critical applications like software used in NASA or something, you'd invest in something like type contracts to really guarantee safety beyond just unit testing it. But I think it would slow down the development speed. I may be speaking out of ignorance here as I don't know a lot about type contracts in Python. I did try them once in Java and didn't really find them powerful enough to be useful.
Anyway, I'm excited to see where the broader discussion goes on this. In the meantime, I'll have a look at my open PR and try to progress it somewhat.
Rarely do we design and test pvlib functions to accept any data type included in numpy's array_like, e.g., list, tuple, ndarrays with several dimensions. Using array_like and pointing to numpy implies that pvlib will accept all of these types.
If we wrap inputs we expect to be array_like
with np.asarray()
I believe that may solve this problem. For example:
def myfun(a, b):
“”” my function
Parameters
__________
a : float
A scalar float, int would also work fine
b : array_like
Could be an scalar, sequence, or array of any size
“””
b = np.asarray(b)
also most NumPy functions do this conversion already for us so we might not even need it
Rather np.asarray
or np.asanyarray()
?
if we're going to go to all the effort of documenting it, why not just type it?
My understanding is that neither PEP544 (Protocols) nor PEP484 (Type Hints) actually does any type checking unless you invoke an external tool like mypy. There is nothing in python itself that, at runtime or any other time, actually checks that inputs match the type annotations. The runtime semantics are duck typing regardless of whatever type fanciness we put in the signature. So to me, "typing it" is just another form of documentation, not an alternative to it. Sorry for the noise if everyone already knows this (or if I'm misinformed, very possible since I don't use these features myself!).
Also I think PEP544 is python 3.8+, so it is not available to us in the near term anyway (3.7 EOL is June 2023).
If we wrap inputs we expect to be array_like with np.asarray() I believe that may solve this problem.
If this turns out the be the best solution, so be it, but to @campanelli-sunpower's point I do expect a Series out if I put a Series in, which I think means basically every function will start and end with typing boilerplate. Or maybe we could be clever with a decorator. I see value in being consistent with other pydata packages, but does get_total_irradiance
really need to support ghi
typed as tuple
?
@kanderso-nrel are you okay with np.asarray
and array_like
?
I think I'm against it? If we (1) don't want to be using array_like
differently from numpy, AND (2) want output type to match input type, then I think any function claiming to support array_like
would have to jump through a bunch of hoops (convert inputs to array, convert back to Series/tuple/whatever before returning), that seems inefficient and unnecessary to me.
I'd rather we come up with our own abstract types (e.g. our current array-like
but with a different name) or just spell out np.array, pd.Series
in dosctrings instead of expanding our API to support every type that numpy's array_like
does.