PyKrige icon indicating copy to clipboard operation
PyKrige copied to clipboard

Statistics calculations... automatic, or switched by the user?

Open bsmurphy opened this issue 7 years ago • 4 comments

The policy on whether or not to automatically calculate the statistics (set in places with the enable_statistics boolean kwarg) is inconsistent across the different kriging classes and isn't actually even re-done in the update_variogram_model method in each class. (I think this is just my mistake from way back when I was putting this all together in the first place, and probably also a problem of having similar functionality have to be copied over so many classes.) I'm thinking about just making the statistics calculations automatic/default (i.e., get rid of the enable_statistics flag that appears inconsistently across classes). With the fixes in PRs #47 and #51, the statistics calculations should go smoothly now. Then the user can access them with the existing infrastructure, and if verbose (or someday logging) is enabled then they'll be spit out there. @rth, @basaks, @whdc, thoughts? Along with this change, I'm also thinking of adding a few other outputs to the statistcs, like @whdc suggested, and also some more useful calculations from the Kitanidis text...

bsmurphy avatar Mar 09 '17 02:03 bsmurphy

I'm thinking about just making the statistics calculations automatic/default (i.e., get rid of the enable_statistics flag that appears inconsistently across classes).

Do we know how long does it take to compute statistics compared to computing the kriging? If it's negligible, we could certainly compute them automatically (though I'm still +1 for having an enable_statistics flag for that, even if it is set to True by default) .

rth avatar Mar 10 '17 20:03 rth

For all the datasets I've worked with, the statistics calculation has been really fast, although I guess I don't actually know how it scales... regardless, if you think a flag would be better, we can go with that. More options is probably better than fewer.

bsmurphy avatar Mar 10 '17 21:03 bsmurphy

Presumably it's O(N^4) time, where N is the number of data points, since matrix inversion is O(N^3) time and this is done every time the data points are added, one by one. The krige itself is just O(N^3) time, so the statistics calculation will become disproportionately expensive with, say, N=10000.

I think you can turn the statistics calculation into O(N^3) time by taking advantage of the fact that the matrix being inverted is the same every time, except with an additional column and row. Use the formula for block matrix inversion, with A being the previous matrix for which you already have the inverse. If the operations are carried out in the right order, this matrix inverse is O(N^2) time, so the statistics calculation becomes O(N^3) time. Presumably this will make a big difference for N=10000, but in the grand scheme of things, I don't know if this sort of optimization is a meaningful priority.

whdc avatar Mar 11 '17 22:03 whdc

Thanks for the insight, @whdc. Definitely going to become problematic for big datasets (my datasets are always small so I keep forgetting to think about scaling!), so we'll at least keep (or really extend) the switch. I think in @rth's pipeline idea (#56) this would naturally become a separate "brick."

bsmurphy avatar Mar 16 '17 03:03 bsmurphy