Reimplement `.polyfit()` with `apply_ufunc`
- [x] Closes #4554
- [x] Closes #5629
- [x] Closes #5644
- [ ] Tests added
- [x] Passes
pre-commit run --all-files - [ ] User visible changes (including notable bug fixes) are documented in
whats-new.rst
Reimplement polyfit using apply_ufunc rather than dask.array.linalg.lstsq. This should solve a number of issues with memory usage and chunking that were reported on the current version of polyfit. The main downside is that variables chunked along the fitting dimension cannot be handled with this approach.
There is a bunch of fiddly code here for handling the differing outputs from np.polyfit depending on the values of the full and cov args. Depending on the performance implications, we could simplify some by keeping these in apply_ufunc and dropping later. Much of this parsing would still be required though, because the only way to get the covariances is to set cov=True, full=False.
A few minor departures from the previous implementation:
- The
rankandsingular_valuesdiagnostic variables returned bynp.polyfitare now returned on a pointwise basis, since these can change depending on skipped nans.np.polyfitalso returns thercondused for each fit which I've included here. - As mentioned above, this breaks fitting done along a chunked dimension. To avoid regression, we could set
allow_rechunk=Trueand warn about memory implications. - Changed default
skipna=True, since the previous behavior seemed to be a limitation of the computational method. - For consistency with the previous version, I included a
transposeoperation to putdegreeas the first dimension. This is arbitrary though, and actually the opposite of howcurvefitreturns ordering. So we could match up withcurvefitbut it would be breaking for polyfit.
No new tests have been added since the previous suite was fairly comprehensive. Would be great to get some performance reports on real-world data such as the climate model detrending application in #5629.
Unit Test Results
6 files 6 suites 58m 4s :stopwatch: 16 290 tests 14 551 :heavy_check_mark: 1 739 :zzz: 0 :x: 90 936 runs 82 738 :heavy_check_mark: 8 198 :zzz: 0 :x:
Results for commit 62b4637d.
@slevang are you still interested to continue this PR? I think that would be a worthwhile addition and should not be too much left to do. (What would be nice, however, are tests for the issues this fixes.)
@slevang are you still interested to continue this PR? I think that would be a worthwhile addition and should not be too much left to do. (What would be nice, however, are tests for the issues this fixes.)
Definitely! I got distracted is all, and @dcherian posted a nice solution in #5629 that could allow us to preserve the ability to fit along a chunked dimension using blockwise operations and the dask lstsq implementation used by the existing polyfit code.
I'm happy to pick this back up and finish it off if there is consensus on the right way forward, but the blockwise approach seemed promising so I put this on hold.
@slevang Yeah, the blockwise approach seems indeed nice. You're very welcome to continue with the blockwise approach in a different PR if you want to.
Not sure I understand the blockwise approach well enough to make a PR, but maybe I'll give it a try at some point.
I think you can use a lot of @dcherian's code as a base and then for starters see if it simply passes all the tests (including the ones you added here). If you make a draft PR here it's easier to help out as well if you're getting stuck.