pvlib-python icon indicating copy to clipboard operation
pvlib-python copied to clipboard

irradiance.clearsky_index() isn't tolerant of pandas DataFrame inputs

Open jranalli opened this issue 2 years ago • 9 comments

Describe the bug I'm getting some issues with irradiance.clearsky_index() with mixed datatypes on inputs. Specifically, clearsky_ghi is often a pandas Series. If working with a ghi input that shares an index, but is a DataFrame, division behaves strangely due how pandas interprets the indices. I understand that this is kind of a peculiarity of pandas rather than something wrong with pvlib, so it may be something that you don't want to fix. But I just wanted to bring it up.

To Reproduce Steps to reproduce the behavior: This code produces an unexpected output. Though this is an example of a 1 column DataFrame, which is a trivial case, this same thing happens given a DataFrame with multiple columns as well.

index = pd.date_range('2022-10-10 10:00:00', '2022-10-10 11:00:00', freq='1h')
cs_ghi = pd.Series([1,1], index=index)
ghi = pd.DataFrame([0.5,0.5], columns=['A'], index=index)
pvlib.irradiance.clearsky_index(ghi, cs_ghi)
# returns array([[0., 0., 0.], [0., 0., 0.]])

Expected behavior Since they share an index, ideally ghi would still be converted to clearsky_index. In principle, this ought to work even with multiple columns in the DataFrame (the actual use case I have). So the output that you'd like to see here is:

                       A
2022-10-10 10:00:00  0.5
2022-10-10 11:00:00  0.5

Versions:

  • pvlib.__version__: 0.9.0
  • pandas.__version__: 1.3.2
  • python: 3.9.7

Additional context There's a current workaround, which is to apply the function column by column:

ghi.apply(lambda x: pvlib.irradiance.clearsky_index(x, cs_ghi), axis=0)

There is a way to resolve it internally and have the division not broadcast if you know that ghi is a DataFrame

clearsky_index = ghi.div(cs_ghi, axis=0)

But there are then some issues with broadcasting for the logical tests later on, and possibly formatting the output.

So again, this boils down to whether enough people use DataFrames as inputs that it's worth going through that typing business within the pvlib function. I'm happy to look at how to make it be a little more type agnostic if you want to resolve it that way. If not, it might be worth documenting somehow that both ghi and cs_ghi must be 1-D for this to work.

jranalli avatar May 25 '22 18:05 jranalli

it's worth going through that typing business within the pvlib function.

I think we're leaning towards the other direction: avoid anything to do with types in functions like this. See #1455. I'm not aware of any broadcasting tricks that are compatible with both numpy and pandas that would solve this problem but maybe someone else is.

The code below solves the problem if you're willing to call .to_frame() on the input Series to add another dimension, then .to_numpy() to strip the index.

You'd still be stuck with array output though. In many pvlib functions we use np.where because it can handle scalars (unlike boolean indexing), but it always returns a numpy array. That means that we instead need special code to handle Series (or anything else like DataFrame). Might be better to cast a scalar to an array, use boolean indexing in the function logic, then reduce the result to a scalar. Then the DataFrame in will result in a DataFrame out. Let's see..

# modifications in pvlib.irradiance.clearsky_index
def clearsky_index(ghi, clearsky_ghi, max_clearsky_index=2.0):
    clearsky_index = ghi / clearsky_ghi
    if np.isscalar(clearsky_index):
        scalar_out = True
        clearsky_index = np.asarray(clearsky_index)
    else:
        scalar_out = False
    # set +inf, -inf, and nans to zero
    clearsky_index[~np.isfinite(clearsky_index)] = 0
    # but preserve nans in the input arrays
    input_is_nan = ~np.isfinite(ghi) | ~np.isfinite(clearsky_ghi)
    clearsky_index[input_is_nan] = np.nan

    clearsky_index = np.maximum(clearsky_index, 0)
    clearsky_index = np.minimum(clearsky_index, max_clearsky_index)

    if scalar_out:
        return clearsky_index.item()
    return clearsky_index


In [99]: irradiance.clearsky_index(.5, 1)
Out[99]: 0.5

index = pd.date_range('2022-10-10 10:00:00', '2022-10-10 12:00:00', freq='1h')

cs_ghi = pd.Series([1,1,1], index=index)

In [100]: ghi_df = pd.DataFrame([[0.5,0.5],[1, 1],[.75, .75]], columns=['A', 'B'], index=index)

In [101]: ghi_s = pd.Series([0.5, 1, .75], index=index)

In [104]: irradiance.clearsky_index(ghi_s, cs_ghi)
Out[104]:
2022-10-10 10:00:00    0.50
2022-10-10 11:00:00    1.00
2022-10-10 12:00:00    0.75
Freq: H, dtype: float64

In [107]: irradiance.clearsky_index(ghi_df, cs_ghi.to_frame().to_numpy())
Out[107]:
                        A     B
2022-10-10 10:00:00  0.50  0.50
2022-10-10 11:00:00  1.00  1.00
2022-10-10 12:00:00  0.75  0.75

In [109]: ghi_da = da.from_array(np.array([0.5, 1, .75]))

In [110]: ghi_cs_da = da.from_array(np.array([1, 1, 1]))

In [111]: irradiance.clearsky_index(ghi_da, ghi_cs_da)
Out[111]: dask.array<minimum, shape=(3,), dtype=float64, chunksize=(3,), chunktype=numpy.ndarray>

In [112]: irradiance.clearsky_index(ghi_da, ghi_cs_da).compute()
Out[112]: array([0.5 , 1.  , 0.75])

The special case for scalars is gross but probably worth it.

wholmgren avatar May 25 '22 23:05 wholmgren

It looks like that solution would solve my problem, and just requires some extra manipulation on the function call. I'm totally fine with that if that's what's felt to be the best solution.

That said, I was trying to think of the next person who comes along and is confused about why their code doesn't work. I had encountered this before and knew where to turn. The only "error" I reached though was my pagefile getting hit and the computer churning to a halt as it tried to build a 30GB array. It might be kind of hard to figure out what's going on with that division if you weren't expecting it.

What if you added something relatively generic to protect the input like:

    if np.ndim(ghi) > np.ndim(clearsky_ghi):
        clearsky_ghi = np.expand_dims(clearsky_ghi, axis=-1)

I'm not sure how it would hold up in the event that somebody came in with something crazy and 3-d, but it at least handles the cases you proposed (excluding dask which I'm not familiar with).

# modifications in pvlib.irradiance.clearsky_index
def clearsky_index(ghi, clearsky_ghi, max_clearsky_index=2.0):
    if np.ndim(ghi) > np.ndim(clearsky_ghi):
        clearsky_ghi = np.expand_dims(clearsky_ghi, axis=-1)
    clearsky_index = ghi / clearsky_ghi
    if np.isscalar(clearsky_index):
        scalar_out = True
        clearsky_index = np.asarray(clearsky_index)
    else:
        scalar_out = False
    # set +inf, -inf, and nans to zero
    clearsky_index[~np.isfinite(clearsky_index)] = 0
    # but preserve nans in the input arrays
    input_is_nan = ~np.isfinite(ghi) | ~np.isfinite(clearsky_ghi)
    clearsky_index[input_is_nan] = np.nan

    clearsky_index = np.maximum(clearsky_index, 0)
    clearsky_index = np.minimum(clearsky_index, max_clearsky_index)

    if scalar_out:
        return clearsky_index.item()
    return clearsky_index

index = pd.date_range('2022-10-10 10:00:00', '2022-10-10 12:00:00', freq='1h')
cs_ghi = pd.Series([1,1,1], index=index)
ghi_df = pd.DataFrame([[0.5,0.5],[1, 1],[.75, .75]], columns=['A', 'B'], index=index)
ghi_s = pd.Series([0.5, 1, .75], index=index)

In:  clearsky_index(.5, 1)
Out: 0.5

In:  clearsky_index(ghi_s, cs_ghi)
Out: 
2022-10-10 10:00:00    0.50
2022-10-10 11:00:00    1.00
2022-10-10 12:00:00    0.75
Freq: H, dtype: float64

In:  clearsky_index(ghi_df, cs_ghi)
Out: 
                        A     B
2022-10-10 10:00:00  0.50  0.50
2022-10-10 11:00:00  1.00  1.00
2022-10-10 12:00:00  0.75  0.75

# And with array input types rather than pandas
In:  clearsky_index(ghi_df.values, cs_ghi.values)
Out: 
array([[0.5 , 0.5 ],
       [1.  , 1.  ],
       [0.75, 0.75]])

In:  clearsky_index(ghi_s.values, cs_ghi.values)
Out: 
array([0.5 , 1.  , 0.75])

jranalli avatar May 26 '22 01:05 jranalli

Ok got a followup. I actually made this work for 3D arrays as well as all the tests you proposed. The trick is to replace my previous proposal with:

while np.ndim(ghi) > np.ndim(clearsky_ghi):
    clearsky_ghi = np.expand_dims(clearsky_ghi, axis=-1)

Then here's a test on a 3D array, where the time is the "page" dimension, which is actually the first axis.

  ghi_td = np.array([[[100, 100], [100, 100]], [[200, 200], [200, 200]], [[500, 500], [500, 500]]])
  ghi_cs = np.array([500, 800, 1000])
  out = irradiance.clearsky_index(ghi_td, ghi_cs)
  expected = np.array([[[0.2, 0.2], [0.2, 0.2]], [[0.25, 0.25], [0.25, 0.25]], [[0.5, 0.5], [0.5, 0.5]]])

I put this in a branch, and would be happy to make it into a full pull request if you like this solution.

jranalli avatar May 26 '22 13:05 jranalli

An alternative might be to handle the DataFrame/Series mix case separately: see this link. I expect it would also be gross.

cwhanse avatar May 26 '22 14:05 cwhanse

That was my first instinct, but like @wholmgren suggested, it doesn't help with the numpy-friendliness across the board.

jranalli avatar May 26 '22 14:05 jranalli

That said, I was trying to think of the next person who comes along and is confused about why their code doesn't work. I had encountered this before and knew where to turn. The only "error" I reached though was my pagefile getting hit and the computer churning to a halt as it tried to build a 30GB array. It might be kind of hard to figure out what's going on with that division if you weren't expecting it.

That experience is very familiar! I don't recall hitting it in any pvlib code but routinely hit it in my own code even though I should know better by now.

Is there something unique to clearsky_index that makes it much more likely that users will hit this problem in this function? It's not obvious to me. I ask because I am worried about establishing an API precedent that we'll feel compelled to follow throughout the library.

What about the case where ghi is a Series and ghi_clearsky is a DataFrame? The use case for that is less obvious in this function, but something similar may come up if we try to apply this API design to other functions.

In [141]: ghi_s = pd.Series([0.5, 1, .75], index=index)

In [142]: cs_ghi_df = pd.DataFrame(np.ones((3, 2)), columns=['A', 'B'], index=index)

In [143]: ghi_s / cs_ghi_df
Out[143]:
                     2022-10-10 10:00:00  2022-10-10 11:00:00  2022-10-10 12:00:00   A   B
2022-10-10 10:00:00                  NaN                  NaN                  NaN NaN NaN
2022-10-10 11:00:00                  NaN                  NaN                  NaN NaN NaN
2022-10-10 12:00:00                  NaN                  NaN                  NaN NaN NaN

In [144]: ghi_s.div(cs_ghi_df, axis=0)
Out[144]:
                     2022-10-10 10:00:00  2022-10-10 11:00:00  2022-10-10 12:00:00   A   B
2022-10-10 10:00:00                  NaN                  NaN                  NaN NaN NaN
2022-10-10 11:00:00                  NaN                  NaN                  NaN NaN NaN
2022-10-10 12:00:00                  NaN                  NaN                  NaN NaN NaN

In [145]: ghi_s.to_frame().to_numpy() / cs_ghi_df
Out[145]:
                        A     B
2022-10-10 10:00:00  0.50  0.50
2022-10-10 11:00:00  1.00  1.00
2022-10-10 12:00:00  0.75  0.75

Forgot to say in my initial response... great to see people pushing pvlib with more data!

wholmgren avatar May 26 '22 15:05 wholmgren

My particular use case is that I have a data frame of multiple sensor time series and want to convert them all to clearsky index. One reason that I think this might be likely to come up is that when you get the clearsky irradiance (e.g. from clearsky.iniechen), the output from that function will be a DataFrame with columns, from which you'll select ['ghi'], resulting in a series.

So anytime someone has ghi in a dataframe to begin with, or has multiple time series, this is going to happen.

jranalli avatar May 26 '22 16:05 jranalli

Ok so I dug a bit further into this based on @wholmgren's discussion question about clearsky_ghi being a DataFrame while ghi is a Series. The potential use case I came up with is if the user wants to take a single measured GHI and compare the clearsky_index based on several different clearsky_ghi models. So they'd have a DataFrame of different clearsky timeseries, but a single Series coming in of GHI. The reverse of my original solution actually works to solve that:

Reverse:

while np.ndim(clearsky_ghi) > np.ndim(ghi):
    ghi = np.expand_dims(ghi, axis=-1)

This will produce an output that's a DataFrame rather than a Series, but I feel like a user would need to be prepared for something like that if they're choosing to input mismatched types to start with.

I put this in my fork branch, and it works on all the cases that have been proposed thus far. But that said, I think this is already in the realm of relatively case-specific solutions. There are too many cases to really exhaustively test and maybe "pandas is finnicky about division" is just a user beware situation, and it's up to the user to precondition inputs. As I said before, I do think this function is one that is particularly likely to run into issues, because outputs from the clearsky functions are at times converted to DataFrames, from which you'd need to select the ['ghi'] column to get your clearsky values. But it's still reasonable to let the user figure it out on their own data shapes/types.

If that's the solution though, to bring it back to my original concern, philosophically, I never really know where to err on the "let the user figure out that function returned garbage" or "error out rather than return garbage" balance. Rather than trying to anticipate the broadcasting, do you think it's worth raising an error if the data types are KNOWN to fail as a combination, such as when one of these is a Series and one a DataFrame?

jranalli avatar Jul 11 '22 17:07 jranalli

My 2 cents: both runtime type checking and explicit broadcasting are things I'd be reluctant to add to pvlib, but if we did then it seems better to me for them to be implemented in some generic and complete way that we can easily apply and reuse anywhere in the package rather than putting ad-hoc code at the top of each function of interest.

Rather than trying to anticipate the broadcasting, do you think it's worth raising an error if the data types are KNOWN to fail as a combination, such as when one of these is a Series and one a DataFrame?

That function doesn't claim to support DataFrame, so on some level it's on the user to stay away from "here be dragons" territory. If the argument is that users are likely to accidentally pass in a DataFrame when they meant to do something else, I think the same would apply to a big chunk of pvlib -- anything directly downstream of get_solarposition, get_total_irradiance, etc, would have that problem. I totally agree that people (including myself) trip over that issue, I'm just pointing out that it's not unique to clearsky_index() and it's probably worth thinking through somehow fixing it more broadly instead of on a case-by-case basis.

Edit: although that's not to say that investigating clearsky_index isn't useful as a case study!

kandersolar avatar Jul 11 '22 18:07 kandersolar