PyKrige
PyKrige copied to clipboard
Adaptation to sklearn
sklearn provides a lot of functionality, we could use to simplify our code.
Distance matrix calculation: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise_distances.html
KDtree: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KDTree.html
Search radius to optimize moving window kriging (See: #57): https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KDTree.html#sklearn.neighbors.KDTree.query_radius
Nearest neighbors: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KDTree.html#sklearn.neighbors.KDTree.query
Beside that, one can specify the metric in use (see: #120): https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.DistanceMetric.html
- “euclidean” | EuclideanDistance | | sqrt(sum((x - y)^2))
- “manhattan” | ManhattanDistance | | sum(|x - y|)
- “chebyshev” | ChebyshevDistance | | max(|x - y|)
- “minkowski” | MinkowskiDistance | p | sum(|x - y|^p)^(1/p)
- “wminkowski” | WMinkowskiDistance | p, w | sum(|w * (x - y)|^p)^(1/p)
- “seuclidean” | SEuclideanDistance | V | sqrt(sum((x - y)^2 / V))
- “mahalanobis” | MahalanobisDistance | V or VI | sqrt((x - y)' V^-1 (x - y))
Or for geo-coordinates (see: #121):
- “haversine” | HaversineDistance | 2 arcsin(sqrt(sin^2(0.5dx) + cos(x1)cos(x2)sin^2(0.5dy))) (np.deg2rad needed here)
So this could solve a lot of issues.
Yes, I think that would be a good idea. I'm all for better integration with scikit-learn in general https://github.com/GeoStat-Framework/PyKrige/issues/31
When improving the API for 2.0, making one that's compatible with scikit-learn if possible would be useful https://scikit-learn.org/stable/developers/develop.html#rolling-your-own-estimator
@rth Could you draft a general workflow for the integration with sklearn? I have to learn a bit more about sklearn to get to know the dos and don'ts..
Essentially one could start with minimal functionality along the lines of,
from sklearn.base import BaseEstimator
class OrdinaryKrigging(BaseEstimator):
def __init__(self, ...):
...
def fit(self, X, y):
...
def predict(self, X, return_std=False):
...
where X
is (n_samples, n_features)
array (with for now n_features
= 2 or 3 being supported only).
Looking at the usage of GaussianProcessRegressor in scikit-learn could be a start. Then one would progressively add features, and form time to time run check_estimator
to make sure the API is compatible.
We already have a scikit-learn wrapper class BTW https://github.com/GeoStat-Framework/PyKrige/blob/40a8140daf31be9c8cc44efc8cae2bf8565c0e4a/pykrige/rk.py#L27 , but the idea could be to make this more of a default API.
So we should have a look at: https://github.com/scikit-learn-contrib/project-template
So we should have a look at: scikit-learn-contrib/project-template
Worth looking into but there are also things outdated there.
OK. In the end I am thinking of a single class for kriging where the dimension is just a parameter. And since Ordinary kriging is just universal kriging with a constant drift (or simply an unbiased estimator), we could combine all 4 kriging classes that are present at the moment. When providing these in the proposed sklearn way, we can even get rid of the RK class. Which sounds nice to me.
You once mentioned that anisotropy and rotation could be provided by pipelines (#138). How does that work?
OK. In the end I am thinking of a single class for kriging where the dimension is just a parameter. And since Ordinary kriging is just universal kriging with a constant drift (or simply an unbiased estimator), we could combine all 4 kriging classes that are present at the moment. When providing these in the proposed sklearn way, we can even get rid of the RK class. Which sounds nice to me.
Yeah, the implementation could be a single class for kriging. The dimension could just be determined from the X.shape[-1]
passed to fit
method. Though having OrdinalKriging
/ UniversalKriging
aliases still might not hurt (if only so that people find them when searching for those key words online). One of the current issue is that the classes have too many parameters.
You once mentioned that anisotropy and rotation could be provided by pipelines (#138). How does that work?
The idea is that if some of those calculations, are stand alone transformations of coordinates (i.e. X (n_samples, n_features), where features = ['x', 'y']) converted to X_transformed with the same shape, then it could be a stand-alone pre-processing class. And pipeline could be built using make_pipeline
(see docstring example in the link) from scikit-learn. This allows to make things more modular. For instance,
from sklearn.pipeline.make_pipeline
pipe = make_pipeline(Preprocessor(), OrdinalKriging())
pipe.fit(coords, target)
z = pipe.predict(coords_grid)
where internally for pipe.fit
, it would first call preprocessor.fit, then preprocessor.transform
to get the transformed coordinates, and finally krigging.fit on these coordinates.
Then for pipe.predict
it would first call reprocessor.transform
then krigging.predict on those transformed coordinates.
This logic is built-in in scikit-learn pipelines, and it's possible to chain muliple processing steps like this. And also combine estimators from different packages, assuming they follow the scikit-learn API.
I haven't done a detailed analysis to see to what extent it's possible in pykrige. One limitation of scikit-learn pipelines is that the target variable (previously known as z) cannot be transformed in pipeline steps. But there are workarounds if it's really a blocker.
One question I have about the pre-processor is: What should be fitted? Only thing we want is, to transform the coordinates with the rotation-angles and anisotropy ratios given (we can't fit these, since we would need to fit a spatial variogram to the data to achive this, which is quite hard for a sparse data set)
So we would only use transform
with fixed parameters (angles, ratios), right?
One question I have about the pre-processor is: What should be fitted? Only thing we want is, to transform the coordinates with the rotation-angles and anisotropy ratios given (we can't fit these, since we would need to fit a spatial variogram to the data to achive this, which is quite hard for a sparse data set)
If fit does nothing, it's fine. But it's still needs to be there for API compatibility,
from sklearn.base import BaseEstimator, TransformerMixin
class Preprocessor(BaseEstimator, TransformerMixin): # or give it a better name
def __init__(self, parameter="some_default"):
self.parameter = parameter
def fit(self, X, y):
"""This estimator is stateless, fit does nothing."""
return self
def transform(self, X):
# actual transformation here
return X_transformed
The base classes are necessary to,
-
BaseEstimator
marks it as compatibile with scikit-learn API and also definesset_params
/get_params
methods, to set/get parameters passed in__init__
, -
TransformerMixin
marks it as a transformer, and defines,
def fit_transform(X, y):
return self.fit(X, y).transform(X)
which in this case is equivalent to just a transform.
I was digging through the sklearn code and it seems, that you can't use the kdtree directly in cython (link). Since it would be good to prevent the python-overhead in creating the distance matrix, I would suggest forking the critical parts of sklearn.neighbors (link):
-
_binary_tree.pxi
- Base class for the kdtree -
_kd_tree.pyx
- kdtree itself -
_dist_metrics.pyx
- the provided distance metrics -
_dist_metrics.pxd
- header file for the distance metrics -
_typedefs.pyx
- type definitions -
_typedefs.pxd
- header file for type definitions
Only thing that needs to be modified, is commenting out from ..utils import check_array
, and all occurrences of check_array
in _binary_tree.pxi
(4 times), since we can take care of the input our selfs (needs to be np.float64
array). Then, the 6 files above should be enough. We only need to create a header file _kd_tree.pyd
(which is not present in sklearn, that's why the effort), to use the kdtree directly in cython.
advantages:
- we don't need to create the full distance matrix in python (which can be huge for big data sets (~10000 points)
- when implementing the moving window, we can then simply:
- loop over the points in cython
- get all nearest neighbors
- create the kriging matrix directly in cython
- solve the kriging equation in cython (even parallelizable with prange and openmp)
ATM the kriging matrix is created in python and passed to cython (when doing the full kriging). With n nearest neighbors, a list of points is passed to cython to cut out the reduced kriging matrix (again, the full matrix is present)
The idea would be, to create a companion class to the future kriging class in cython, that holds all necessary information and does the number-crushing. All we need is the variogram in cython, which is already present.
We could also cut down the functionality of the kdtree to only provid:
-
query_radius
- for moving window -
query
- for n nearest neighbors (which is provided by pykrige ATM)
So we can keep the external code base at a minimum.
@rth, @bsmurphy what do you think?
Do you think there is no way of doing this without using KDTree/Balltree from cython? E.g. computing the calculation in small batches or using sparse distance matrices. See e.g. KNeighborsTransformer
and the Aproximate nearest neighbours example.
I would suggest forking the critical parts of sklearn.neighbors
That code is legacy and quite convoluted (cf e.g. https://github.com/scikit-learn/scikit-learn/pull/4217) I would not recommend forking (and maintaining) it unless there is no way around it.
is commenting out from ..utils import check_array, and all occurrences of check_array in _binary_tree.pxi (4 times), since we can take care of the input our selfs (needs to be np.float64 array).
There is a SLEP 10 to replace check_array
with _validate_data
method in scikit-learn (which can be overwritten in contrib projects). It was partially merged https://github.com/scikit-learn/scikit-learn/pull/16112 and will be available in the next release (in ~1 month or so). If some estimators are not yet done it should be fairly easy to add them.
If we can't use the KDTree in cython, the loop over the gird-points needs to be done in python, or the full-distance matrix needs to be constructed and passed to cython to do the loop there. In the first case, we get a massive overhead in time, in the second case we get a massive overhead in memory consumption.
It seems, that the sklearn team doesn't want to provide kdtree for cython, since they think, it is to much of an maintainance overhead.
I'll think about it.
or the full-distance matrix needs to be constructed and passed to cython to do the loop there [...] in the second case we get a massive overhead in memory consumption.
Not if we use a sparse distance matrix (that only stores K nearest neigbours for each point).
It seems, that the sklearn team doesn't want to provide kdtree for cython, since they think, it is to much of an maintainance overhead.
You could try opening an issue about it there, but I do anticipate mostly negative feedback indeed due to maintenance costs.
If we use only K nearest neighbors, we still can't provide a search radius for the moving window. K-nearest neigbors always depend on the densitiy of the input data.
If we use only K nearest neighbors, we still can't provide a search radius for the moving window. K-nearest neigbors always depend on the densitiy of the input data.
I think it should also be possible to create a sparse distance matrix for neighbors within a given radius: i.e. a KNeighborsTransformer
but that uses radius_neighbors_graph
internally instead of kneighbors_graph
from NearestNeighbors
. But I guess not both of those at the same time.