pvlib-python icon indicating copy to clipboard operation
pvlib-python copied to clipboard

use array_like instead of numeric/array-like

Open wholmgren opened this issue 5 years ago • 41 comments

The pvlib parameters/returns type documentation style is:

  • numeric : scalar, np.array, pd.Series. Typically int or float dtype.
  • array-like : np.array, pd.Series. Typically int or float dtype.

However, Numpy defines array_like as "Any sequence that can be interpreted as an ndarray. This includes nested lists, tuples, scalars and existing arrays."

I think we should follow the numpy standard. We lose the implied specification for whether or not a function works with scalar input, but this seems to be a source of confusion anyways. Instead, the docstrings for functions that require array input with length > 1 could explicitly state this.

wholmgren avatar Aug 14 '19 15:08 wholmgren

I think that using array-like in a different way from numpy is probably not such a good idea, but then perhaps we should just not use array-like at all? I have the impression that there are lots of pvlib functions that support only a subset of array-like.

I would prefer pvlib documentation to be more explicit, which might mean putting more than one word, e.g.:

  • scalar
  • list
  • scalar or list

adriesse avatar Dec 06 '19 14:12 adriesse

Hey, @adriesse @wholmgren I am new to this library and would like to contribute. I have experience with documentation in sphinx. Kindly guide me in solving this issue. Also, guide me on whether should I start working with this issue or with issue #476 ?

percygautam avatar Jan 13 '20 14:01 percygautam

@percygautam thanks for your interest! Maybe you could run a couple of find/replace-all commands.

Note the contributing documentation contains some general suggestions for getting started with git/github.

wholmgren avatar Jan 13 '20 17:01 wholmgren

Thanks @wholmgren , I'll start exploring the documentation and would start working on issues.

percygautam avatar Jan 13 '20 22:01 percygautam

There are performance and syntactic differences between a ndarray of shape (1, 1, ...) and a python scalar, so I don't think the terms are interchangeable if numeric includes true native scalars

In [1]: import numpy as np

In [2]: a = np.array(1.23)

In [3]: a
Out[3]: array(1.23)

In [4]: a.shape
Out[4]: ()

In [5]: a.size
Out[5]: 1

In [6]: np.isscalar(a)
Out[6]: False

In [7]: b = 2.34

In [8]: c = np.array([3.45])

In [9]: np.isscalar(b)
Out[9]: True

In [10]: np.isscalar(c)
Out[10]: False

In [11]: b.shape
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-11-a8244287b6af> in <module>
----> 1 b.shape

AttributeError: 'float' object has no attribute 'shape'

In [12]: c.shape
Out[12]: (1,)

mikofski avatar May 14 '21 04:05 mikofski

The numpy definition of array_like is now:

Any scalar or sequence that can be interpreted as an ndarray. In addition to ndarrays and scalars this category includes lists (possibly nested and with different element types) and tuples. Any argument accepted by numpy.array is array_like.

pvlib's definition of numeric is a subset of numpy's definition of array_like. The major difference I see is that array_like includes lists, whereas numeric does not. They both include native scalars. pvlib's definition of array-like excludes scalars.

wholmgren avatar May 14 '21 15:05 wholmgren

I just did a quick search and there seem to be only 4 occurrences of array_like in the code vs. 22 occurrences of array-like. Note the difference between dash and underscore. Presumably the former is the numpy one and the latter the pvlib one.

As @adriesse pointed out in an earlier comment, it's confusing to use array_like differently to how numpy does it. I think it's actually even more confusing given the subtle and almost undetectable (by a human) difference between the underscore version and the dash version.

Some further investigation reveals that all 4 usages of the underscore version array_like are in the same function, namely fit_sandia. I'm not too familiar with Python's array indexing, but looking into that function, there's a lot of array indexing of the parameters that are declared as array_like. Since the numpy definition includes scalars, as @wholmgren pointed out the last comment, I wonder will there be an error if someone tries to input a scalar to that function due to the fact that the function will try to index the scalar? For example, if we entered a value of 7 for dc_voltage, how would this line behave?

v_d = np.array(
        [dc_voltage[dc_voltage_level == 'Vmin'].mean(),
         dc_voltage[dc_voltage_level == 'Vnom'].mean(),
         dc_voltage[dc_voltage_level == 'Vmax'].mean()])

If that would cause an error, then maybe the types of those variables should be the PVLib array-like vs the numpy array_like?


Actually I just went ahead and tried it. Yes there seems to be an error if you input all scalars:

fit_sandia(7, 7, 7, 7, 7, 7)

The error is at this line:

[dc_voltage[dc_voltage_level == 'Vmin'].mean(),

And it says:

TypeError: 'int' object is not subscriptable

Which makes sense. So I'm thinking those 4 and only usages of the underscore (numpy) version array_like can at least be refactored to the dash (PVLib) version array-like.


EDIT: This still doesn't solve this issue, but at least it normalises all usages so that we don't have both array_like and array-like in play. So it might be a good start to just do that refactor now.

ColmBhandal avatar Jan 25 '22 19:01 ColmBhandal

@ColmBhandal thanks for this report.

As author of that function, I am 99.99% certain that I was not mindful of the distinction between array_like and array-like; most likely, it was a typo on my part.

That function should not work with scalar inputs. There probably is a minimum length of each subset dc_voltage[dc_voltage_level == <value>] in order to get useful output, but I would distinguish that minimum from requiring array-like data. If a user passes arrays of length 1, the function should run to completion, but the output will be garbage.

cwhanse avatar Jan 25 '22 19:01 cwhanse

Good idea to take a small bite out of this one @ColmBhandal !

Perhaps numpy is not the best example to follow. I suspect pvlib has a lot more variety in the types of function arguments overall, and I also suspect pvlib functions are far from uniform in their support for potentially equivalent data structures. So I will stick with my initial comment that it is better for pvlib to be explicit in the documentation as to the data types and data structures required or supported by each function.

We could supplement this with programming (not documentation) guidelines on what range of equivalent structures pvlib functions should strive to support. Personally I often throw in an np.asanyarray() in the beginning of functions, which saves me a lot of pain; but it's not always the right thing to do. Come to think of it, I may be following examples from numpy on that technique...

adriesse avatar Jan 26 '22 09:01 adriesse

@adriesse I tend to agree with you. I think the ideal short-medium term solution is:

  1. Very short term: merge my little PR above
  2. Short term: Analyse all usages of (the now uniform) array-like and figure out what it really means (there are only like 26 of them I think, so that's doable)
  3. Come up with a new generic descriptor, or multiple new generic descriptors, to suit each of the usages of array-like, and this time choose something that doesn't look anything like numpy's array_like (so we avoid typos like that of @cwhanse in future)
  4. Refactor the 26 or so remaining occurrences of array-like to use the new descriptors

That's an achievable short/medium term solution, as long as we choose descriptors that don't clash with numpy. We can then leave the discussion of whether we want to use numpy descriptors to a separate issue / future discussion and not let it get in the way of resolving this issue.

ColmBhandal avatar Jan 26 '22 09:01 ColmBhandal

FWIW: I've been tripped up in the past by numpy's own scalar type. numpy also simplifies greatly from not having to support pandas.

campanelli-sunpower avatar Jan 26 '22 17:01 campanelli-sunpower

I have been thinking about this and I wonder if the long term solution is static duck typing i.e. structural types. Minimally, static duck typing will provide a much clearer documentation of the expected type via a type annotation vs. a documentation comment which is subject to typos.

Down the line, static duck typing can also be coupled with static analysers which can be run as part of the standard build steps to ensure all statically duck typed portions of the code are used in a type safe way.

The problem with Python's standard dynamic duck typing is that you don't know that something is not a duck until you ask it to quack, at runtime, and it says "sorry I don't know how", possibly deep into the function. Of course, that's what these documentation descriptors try to tackle. But it's up to a human to write and interpret them. There's no machine support (or is there?)

The nice thing about static duck typing is that you still don't have to nominally type everything you put into a function - which violates the Python ethos. But by statically declaring the structural type (protocol) that a function accepts, you define a clear contract to users. So it's a good balance between strictly nominally typed languages like Java and untyped languages like Python.

ColmBhandal avatar Jan 28 '22 13:01 ColmBhandal

Minimally, static duck typing will provide a much clearer documentation of the expected type via a type annotation vs. a documentation comment which is subject to typos.

But by statically declaring the structural type (protocol) that a function accepts, you define a clear contract to users.

I confess I have only briefly skimmed PEP 544, but I don't see how static duck typing is clearer to the user than docstring comments are. From the user's perspective, isn't it essentially just moving numeric from the docstring to the signature? It might enable users to use static analysis tools to validate their code that uses pvlib, but I'm skeptical more than a tiny fraction of users would be interested in that. Maybe a concrete example would help me understand why static duck typing is better :)

kandersolar avatar Jan 28 '22 14:01 kandersolar

@kanderso-nrel Definitely an example, or many, would be needed to justify such a significant introduction to the code base. I was holding off on the above comment for the past few days as I wanted a solid example to justify it, but I may or may not have time, so figured I'd just throw the idea out there. Certainly if I have time I'd be passionate about exploring this opportunity more, with concrete code examples. And if I do I'll post my findings here :)

ColmBhandal avatar Jan 28 '22 15:01 ColmBhandal

sorry for commenting glibly, but it seems to me the heart of this issue comes down to 2 concerns:

  1. pvlib does not include native python sequences like lists, tuples, sets, in either "array-like" or "numeric" while numpy does
  2. pvlib documentation style doesn't handle scalars consistently, and possible defining what a scalar is -- is it a native python numeric type like int or float and the numpy equivalents np.int32 or np.float64 or does it also include ndarrays that are size 0, 1, or any dimension eg: (0), (1,) or (1,1,...)

native python sequences

I think this only becomes an issue if user tries to invoke a numpy or pandas attribute on a sequence and they get an attribute error. We should be explicit in the documentation if a function requires a pandas series or dataframe, numpy array, or if a sequence will do. I believe pandas series and dataframes have a lot of the same attributes as numpy arrays, and many numpy functions will accept a sequence, so we may need to be diligent to check contributions to see if they will raise an attribute error with a native python sequence. Otherwise, I think that in general it's okay to tell users that pvlib only uses numpy arrays or pandas series and dataframes, and not native python sequences, and just be explicit about that in the docs. I don't know what word we should use to describe such a class of data types. I guess we can't use "array-like" or array_like because numpy has already defined this to include scalars. if we plan to leave out lists, tuples, etc. However one easy, yet tedious solution to this is to invoke np.asarray() which will copy the input to a numpy array (replacing scalars and sequences) or do nothing if the input is already an array. Then we can simply leave the type definitions, docs, comments, and hints as the NumPy glossary version. QED

handling scalar ambiguity

Using np.asarray may resolve this issue as well, but I think this is more of a documentation issue than a programmatic one. We just need to be clear if any length of input including 1 is okay. For input that can only be a scalar, I think the documentation should say which types it should be. I think we have been misusing "numeric" which according to the Python builtin types definition are only int, float, & complex, which by definition is always a scalar only.

So there's my recommendations:

  1. change our definition of array_like to match numpy's
  2. change our definition of numeric to match python's
  3. use np.asarray to assimilate input to array_like for all functions
  4. stress in the documentation that this is a scientific library that deals mostly with arrays and timeseries, and so users will get better mileage by using np.arrays or pandas series and dataframes

now of course this doesn't handle xarray or other "array_like" datatypes, but I think that's out of scope for this issue

mikofski avatar Jan 28 '22 17:01 mikofski

I'm coming around to the view that it may be better to just list the specific types in the docstring, e.g.,

x : scalar, np.array, or pd.Series

rather than

x: numeric

where numeric is defined by a part of the pvlib documentation that a user will likely not have read. Similarly, I don't anticipate pvlib users to know to read numpy's documentation to know what array_like means. Of course, my example opens the question: What is scalar? and perhaps: What is np?

cwhanse avatar Jan 28 '22 17:01 cwhanse

The ambiguity that arises from discussions like these might be eliminated by structural types aka static duck types, because their semantics are rigorously defined. I agree that it's a safer approach not to assume that users have read the numpy docs when interpreting type declarations in the docs. Structural types may be clearer. I've briefly seen them in action in the Gosu language and I thought they were really cool. I'd love to investigate if they'd be applicable to this problem in Python.

ColmBhandal avatar Jan 28 '22 18:01 ColmBhandal

@mikofski I'll be even more glib: "explicit is better than implicit"!

I agree with @cwhanse's comment. The only change I'd make to is to use float instead of scalar - except for when int is required. That avoids the question of python vs numpy scalar. I think the vast majority of people will interpret float as a python float and will blindly use int as well. I can't recall any case when int is actually required but maybe there are a few.

I'm also all for adding type hints to pvlib but I really think that's a separate conversation because they are not easily integrated with documentation at this point in time.

wholmgren avatar Jan 28 '22 18:01 wholmgren

@wholmgren For what it's worth I think I'm with you that the type hints is probably a separate, much larger thing. For this issue to get closed as a practical matter I think just merging my PR (after I add the tests) then analysing the 26 remaining usages of array_like and updating the doc appropriately should do it. Of course, conversations are likely to follow on what those updates will be.

ColmBhandal avatar Jan 28 '22 20:01 ColmBhandal

Questions:

  • Is there good reason/established precedent that pvlib's numeric and array-like include np.array but NOT pd.DataFrame? (see here)
  • Is there any programming style/practice/expectation/contract around a function/method outputs, e.g., returning a pd.Series when a pd.Series was provided? (in cases where this is applicable, and where mixed input types could require some arbitrary convention for the output type)

campanelli-sunpower avatar Jan 28 '22 20:01 campanelli-sunpower

Readability counts. having to explicitly spell out every type is fine, but a possibly ugly. I don't think it's unreasonable to expect pvlib users to be fluent with numpy. Leveraging the numpy API is a common practice among python packages, and I think we can benefit from it too. IMO array_like as defined by numpy is just fine.

On the other hand, I think numeric as pvlib defines is bad. It's defined differently by Python where it has nothing to do with sequences, arrays, or dataframes.

By that same token, I don't think we should use "scalar" as a datatype. I think any function that's meant to be used with arrays will work fine with an array of 1, and any function that requires an int or float should just say that explicitly.

We should consider if we want to wrap inputs meant to be arrays with np.asarray which could solve many of these problems.

mikofski avatar Jan 28 '22 22:01 mikofski

Another perspective is that having to explicitly wrap inputs into arrays is ugly. It delegates type checking to the function's computational part rather than its type signature, so it sort of muddies the responsibility of the function body, which is to compute, rather than type check. I'm relatively inexperienced in Python and from what I'm reading it seems like this is common enough practice though, but I think that's just a trade off with having a dynamically duck typed language.

But yeah, the flip side is that types can get pretty ugly especially when they get as complicated as @campanelli-sunpower's second bullet point references eg different input types guaranteeing different outputs.

ColmBhandal avatar Jan 28 '22 22:01 ColmBhandal

IMO array_like as defined by numpy is just fine

I'll disagree on this point. Rarely do we design and test pvlib functions to accept any data type included in numpy's array_like, e.g., list, tuple, ndarrays with several dimensions. Using array_like and pointing to numpy implies that pvlib will accept all of these types.

There are many functions which users will want to work with either float or array input. At the risk of sidetracking our discussion, I find numpy to be a bit schizophrenic with length 1 arrays and that's why I favor explicitly adding something like float. To make a 1d array with a float, the (to me obvious) np.array(5) does not return what is expected, while np.atleast_1d does. There could be a better way to do it.

len(np.array(5))  # error
len(np.atleast_1d(5))  # works

cwhanse avatar Jan 28 '22 23:01 cwhanse

I think len works consistently for either python or numpy:

len(5)  # TypeError: object of type 'int' has no len()
len([5])  # 1

However, numpy arrays have size, shape, and ndim attributes which might be more useful than calling len.

import numpy as np
a = np.array(5)
b = np.atleast_1d(5)
a.size  # 1
b.size  # 1
a.shape  # ()
b.shape  # (1,)
a.ndim  # 0
b.ndim  # 1

I'm relatively inexperienced in Python and from what I'm reading it seems like this is common enough practice though, but I think that's just a trade off with having a dynamically duck typed language.

We're all learning all the time here, so your contributions have been really helpful. For example, I'd never heard of pep 544 or structured typing so thanks for that! In my experience Python does differ significantly from other languages like Java and C## in that

  1. duck typing is preferred over type-checking, this means that we really don't care if it's the correct type, because if it works then its fine
  2. we let exceptions get thrown by the calling function rather then wrapping them and handling every exception at the top level - this means that users may have to read thru the stack trace to understand what happened, but when they do find the culprit, it's usually more informative then a wrapper message.

Both of these features are embodied in a pythonic idiom called of EAFP or easier to ask forgiveness than permission. The opposite is look before you leap or LBYL. Microsoft developer Brett Cannon explains this in his post: EAFP versus LBYL.

I think these reasons are what draw analysts to python in the first place, but may cause headaches for software developers and probably cause performance penalties compared to more statically typed languages.


I'm curious to see what's going on in other packages. Surely we're not the only ones to have to handle this. What's going on in statsmodels, astropy, scikit-learn, etc.? I'll have to do some research.

mikofski avatar Jan 28 '22 23:01 mikofski

duck typing is preferred over type-checking, this means that we really don't care if it's the correct type, because if it works then its fine

Yep - I figured this was the case reading bits here and there about Python. Typing is quite a subtle issue though and there's a whole spectrum of conventions to choose from. In my mind I've at least identified a series of points along the spectrum:

  1. On the one hand, you have dynamic duck typing - which seems to be the Python idiom. Here, the function doesn't declare the type it accepts - other than maybe the doc string. The type that the function accepts is implicit in its code. If you put something in and it breaks, then that's because you put the wrong thing into the function. And it'll break exactly at the point where the function expected the type to have some property/method that it didn't have. I guess you could say that's an EAFP approach, in a broad sense, although from what I'm reading EAFP is more about using try-catch blocks instead of if checks.
  2. Next you have static duck typing (PEP 544). Here, the type is explicit in the function signature rather than implicit in its body. So I guess the difference here is that it's introducing a sort of LBYL mindframe to the typing of the function itself. What struck me though is the doc strings. Thinking about it, I'm not so sure it's an issue for LBYL vs. EAFP- which seems to be more about coding practices within the function body. To me, it's more a matter of: if we're going to go to all the effort of documenting it, why not just type it? I guess the answer to "why not" is again that the types may not be as easily humanly read, and may be cumbersome to develop/maintain. And I think one of the downsides mentioned in PEP 544 was that it can be cumbersome to write implementations for structural types with a lot of members, as there is no way to make members optional, leading to a lot of boilerplate code. And it's yet to be shown concretely whether structural types are even applicable to the problem at hand e.g. are they able somehow to express the idea of numeric rigorously? On the flip side, types in the doc strings are subject to human misinterpretation and typos.
  3. Then you have nominal types - now we're squarely in the Java/C# world. Here, not only does the function accepting the types need to declare them, but whoever produces the types must actually construct an instance of the exact type the function expects - not just something that looks like it. Of course, you can use generics to do templates that work across multiple types, and you can always declare stuff as Object and cast - but generics have their limitations, and unsafe casting seems to be non-idiomatic in this world and if I see it I usually frown upon it as unsafe. As someone who's been coding in this world for most of my experience, I actually find the restriction to nominal typing often quite annoying. I have no problem with a function being upfront about what it expects, but the fact that the inputs to the explicitly constructed from some statically-declared subtype is restrictive. You need to start writing these wrapper types if you don't have control over types you're accepting from other libraries, for example.

Then to throw a spanner in the works, you have @campanelli-sunpower's mention of type contracts. I can't say that this lies on a strictly linear spectrum of non-typed to nominally typed, but I suppose on the broader spectrum of weakly vs. strongly typed it's definitely on the strongly typed side of things. Now you start talking not just about what type a function accepts and what type it outputs, but you start saying things like: if the function accepts an A then it returns a B, but if it gets a C then it'll return a D. I'm a big fan of strong typing, but I'm not sure how practical that would be, and if the types being used in the contracts are nominal, then I think it violates the Python ethos of duck typing and suffers the same drawbacks to nominal typing that I mentioned above. Maybe in safety-critical applications like software used in NASA or something, you'd invest in something like type contracts to really guarantee safety beyond just unit testing it. But I think it would slow down the development speed. I may be speaking out of ignorance here as I don't know a lot about type contracts in Python. I did try them once in Java and didn't really find them powerful enough to be useful.

Anyway, I'm excited to see where the broader discussion goes on this. In the meantime, I'll have a look at my open PR and try to progress it somewhat.

ColmBhandal avatar Jan 30 '22 18:01 ColmBhandal

Rarely do we design and test pvlib functions to accept any data type included in numpy's array_like, e.g., list, tuple, ndarrays with several dimensions. Using array_like and pointing to numpy implies that pvlib will accept all of these types.

If we wrap inputs we expect to be array_like with np.asarray() I believe that may solve this problem. For example:

def myfun(a, b):
    “”” my function
    Parameters
    __________
    a : float
        A scalar float, int would also work fine
    b : array_like
        Could be an scalar, sequence, or array of any size
    “””
    b = np.asarray(b)

also most NumPy functions do this conversion already for us so we might not even need it

mikofski avatar Jan 30 '22 20:01 mikofski

Rather np.asarray or np.asanyarray()?

adriesse avatar Jan 30 '22 20:01 adriesse

if we're going to go to all the effort of documenting it, why not just type it?

My understanding is that neither PEP544 (Protocols) nor PEP484 (Type Hints) actually does any type checking unless you invoke an external tool like mypy. There is nothing in python itself that, at runtime or any other time, actually checks that inputs match the type annotations. The runtime semantics are duck typing regardless of whatever type fanciness we put in the signature. So to me, "typing it" is just another form of documentation, not an alternative to it. Sorry for the noise if everyone already knows this (or if I'm misinformed, very possible since I don't use these features myself!).

Also I think PEP544 is python 3.8+, so it is not available to us in the near term anyway (3.7 EOL is June 2023).

If we wrap inputs we expect to be array_like with np.asarray() I believe that may solve this problem.

If this turns out the be the best solution, so be it, but to @campanelli-sunpower's point I do expect a Series out if I put a Series in, which I think means basically every function will start and end with typing boilerplate. Or maybe we could be clever with a decorator. I see value in being consistent with other pydata packages, but does get_total_irradiance really need to support ghi typed as tuple?

kandersolar avatar Jan 30 '22 22:01 kandersolar

@kanderso-nrel are you okay with np.asarray and array_like?

mikofski avatar Jan 30 '22 22:01 mikofski

I think I'm against it? If we (1) don't want to be using array_like differently from numpy, AND (2) want output type to match input type, then I think any function claiming to support array_like would have to jump through a bunch of hoops (convert inputs to array, convert back to Series/tuple/whatever before returning), that seems inefficient and unnecessary to me.

I'd rather we come up with our own abstract types (e.g. our current array-like but with a different name) or just spell out np.array, pd.Series in dosctrings instead of expanding our API to support every type that numpy's array_like does.

kandersolar avatar Jan 30 '22 23:01 kandersolar