pvfactors Runtime scales poorly with n

Oftentimes it is desirable to neglect edge effects to model a row in the interior of a large array. Unfortunately, it seems that the runtime scaling with n_pvrows is quite bad -- I'm having trouble pinning down the degree of the asymptotic polynomial complexity (if it even is polynomial, might be factorial?), but it's certainly not linear or even quadratic:

The good news is that going all the way to n_pvrows > 30 is overkill for making edge effects negligible -- n_pvrows=11 seems pretty good, at least for this array geometry:

The bad news is that n_pvrows=11 is still an order of magnitude slower than n_pvrows=3, the current default in the pvlib wrapper function. The code for the above plots is available here: https://gist.github.com/kanderso-nrel/e88e3f7389b9d144a546dbe5651dfe1e

I've not looked into how we might go about improving this situation. If I had to guess, it would require a pretty substantial refactor, if it's possible at all. It would be a pleasant surprise if that guess is incorrect :) But even if it can't be fixed, I think it's useful to have an issue documenting this effect.

Feb 04 '22 16:02 kandersolar

Thanks a lot for this study @kanderso-nrel , that's really awesome data that I never collected before 🙏

Back in the days I focused my time trying to increase the calculation speed with respect to the # of time data points using vectorization. But that's only 1 dimension of the 3 dimensional view factor matrix that is inverted in the PV engine. As the number of PV rows increases, I was expecting at least O(n^2) complexity just for calculating the vf matrix since the other 2 dimensions would increase like O(n); in the end it looks like the full calculation is a lot worse than that 😅

I would be curious to see where it spends the most time (if I have time maybe I'll try to run some quick profiling with pyinstrument or something like that); I would expect in the inversion process but I could be wrong. If that's the case, maybe doing the inversion using a GPU could help..? (I think it would be quite fast to transform the numpy array into a Pytorch tensor)

The only other way to speed things up that I can think of right now is to make some approximations: about the vf matrix, or maybe about irradiance, reflections, etc. Out of curiosity, did you try running the "fast" mode of the engine? I'm wondering how far off it is compared to the full mode (IIRC it only calculates back side irradiance for a single PV row in the whole array).

Feb 14 '22 12:02 anomam

Thanks for the pointer to pyinstrument -- that's a very nice little tool that I will put to use outside of pvfactors as well!

I continue to find it difficult to get consistent timings, but indeed the inversion of a_mat is a large fraction (maybe 50-75%?) of total runtime for a simulation like the one in the original post.

I'm far from a computational linear algebra expert but it seems like it should be possible to somehow take advantage of a_mat's sparseness (often >99% zeros). New array-like functionality was added to scipy.sparse in v1.8 (docs), but it looks like it's limited to 2D for now. I played around a bit with the older matrix-based sparse interface (also 2D, so I had to loop and invert each MxM matrix individually) but didn't find anything faster than np.linalg.inv. I did not try it out, but the pydata sparse package supports N-D sparse arrays, but doesn't seem to have an inverse function? Although perhaps it can be used with scipy.sparse somehow: https://sparse.pydata.org/en/stable/roadmap.html#partial-scipy-integration

Much of the rest of the runtime in full mode is spent building the two vf matrices.

The fast mode is substantially faster: tenths of a second for fast vs tens of seconds for full. I leave it to the reader to decide if the difference in output is significant :)

Feb 18 '22 22:02 kandersolar

take advantage of a_mat's sparseness (often >99% zeros)

that's a good idea! I'll try to investigate on my end as well.

I leave it to the reader to decide if the difference in output is significant

Thanks for trying it : ) I was expecting worse results!! But yeah, I guess it really depends on the application (so the user) in the end.

Feb 19 '22 07:02 anomam

Hmm, another idea is to never explicitly invert the matrix at all? For large simulations I see 10-15% faster calls to engine.run_full_mode by replacing this:

https://github.com/SunPower/pvfactors/blob/e0ea9d5a03d130d25919b39d590645ea2997ad14/pvfactors/engine.py#L223-L226

with q0 = np.linalg.solve(a_mat, irradiance_mat.T).T. I've never sat down to learn the einstein summation notation so I'm not positive this is a perfect replacement, but it seems to yield identical results... here are some timing plots if you're interested: https://gist.github.com/kanderso-nrel/6bf69c35314446b34ca2bfbd3bd41964

Feb 19 '22 15:02 kandersolar

wow that's awesome @kanderso-nrel ! It's very true that inv_a_mat is not used anywhere else so if we can collapse both steps into 1 and at the same time obtain a consistent speed improvement let's go for it!! 🎉 👏 Honestly I don't really remember how the einstein sum works, it was just a trick to get the matrices multiplied in the way I needed 😅 IMO if the tests all pass after making the change that's good enough for me 👍

Feb 23 '22 12:02 anomam

Some more progress, this time by combining the two previous ideas: use a sparse linear solver instead of explicitly inverting the matrix. I want to do some more testing to make sure it behaves nicely outside of my little benchmarks, but here's a runtime comparison in the meantime -- calls to the pvlib wrapper function are 30% to 60% faster in this case:

Feb 24 '22 22:02 kandersolar

After continued efforts to speed up pvfactors (as always, collaborating with @spaneja), I think the timing interpretation I posted above is not wholly correct. tl;dr: I think just using a dense solver like np.linalg.solve is probably better than the irksome sparse solver from #140, but I want to do some more digging.

The runtime of np.linalg.inv and np.linalg.solve, at least on this computer (Win10, i7-8850H, numpy 1.22.3), varies significantly depending on whether numpy is installed from PyPI or from conda-forge. I can only assume the difference is caused by which BLAS library is being linked against; for PyPI it seems to be OpenBLAS while I think conda-forge uses Intel's MKL, with the PyPI/OpenBLAS version being the slower of the two. Debugging this is a little out of my depth, but I did discover that limiting the BLAS's own threading behavior makes a big difference:

Click to expand!

In [1]: from threadpoolctl import ThreadpoolController
   ...: import numpy as np  # v1.22.3, installed from PyPI
   ...: import time
   ...:
   ...: controller = ThreadpoolController()
   ...: a = np.random.random((500, 321, 321))
   ...: b = np.random.random((500, 321))

In [2]: st = time.perf_counter()
   ...: _ = np.linalg.solve(a, b)
   ...: ed = time.perf_counter()
   ...: print(ed - st)
5.766736300000002

In [3]: st = time.perf_counter()
   ...: with controller.limit(limits=1, user_api='blas'):
   ...:     _ = np.linalg.solve(a, b)
   ...: ed = time.perf_counter()
   ...: print(ed - st)
0.5034510000000019

In any case, I suspect the dramatic speedup I saw earlier in this thread and in #140 was more due to using the PyPI/OpenBlas numpy and the original np.linalg call being slow rather than the new scipy.sparse calls being fast. Below is a runtime comparison across a few code variants as well as numpy from both PyPI and conda-forge. The code variants are as follows:

1.5.2: pvfactors v1.5.2 as installed from PyPI (the current np.linalg.inv + np.einsum approach)
#140: the approach used in #140 (sparse linear solve using scipy.sparse.linalg)
np.linalg.solve: a new approach just using np.linalg.solve to replace the current approach
np.linalg.solve+limit: same as np.linalg.solve, but using the threadpoolctl trick to address the strange OpenBLAS slowdown.

Here are the timings of calls to pvlib.bifacial.pvfactors.pvfactors_timeseries using each of the above pvfactors variants, where all times have been normalized to the 1.5.2 variant using numpy from conda-forge (so values below 1.0 represent improvements):

Here are my takeaways:

#140 is an improvement if you're using an OpenBLAS numpy, but might be worse if you're using MKL.
Using a dense linear solver (np.linalg.solve) combined with the threadpoolctl trick gives the best and most consistent speedup across both BLAS packages.
The OpenBLAS issue is concerning and I plan to follow up with the numpy folks about it. It's always possible I'm doing something wrong, but I still see a (smaller, but still substantial) slowdown w/ different processor and OS. edit: no, this is indeed a real thing. See https://github.com/numpy/numpy/issues/15764

May 19 '22 18:05 kandersolar

thanks a lot for this study @kanderso-nrel , I remember testing your PR locally back then before approving, and for me there was no slowdown but no crazy improvement either, so I assumed it was at least as good as the current implementation.

I think your study can be quite useful for people potentially using pvfactors in production etc, and I would be inclined to adding it in the package documentation. But I've been MIA for a while so I'm not even sure there's anyone at SunPower that has ownership of pvfactors... most performance/software people I knew there seem to have left..? @campanelli-sunpower @pfranz-spwr

Oct 01 '22 07:10 anomam

Runtime scales poorly with n_pvrows