scikit-learn-intelex icon indicating copy to clipboard operation
scikit-learn-intelex copied to clipboard

[testing] Expand sklearnex testing to all methods taking "X" or "y" as input

Open icfaust opened this issue 8 months ago • 21 comments

Description

All testing under the sklearnex/tests folder use centralized estimator and method lists which were previously limited to methods based off of the various inherited sklearn Mixins. This tested all 'core' functionality like 'fit', 'predict', 'transform', etc. which are directly patched out by oneDAL functions. Some next level methods (those which use these core functionality with a small number of additional calculations) like score were included. However, this list was incomplete, as some estimators (like PCA and KNeighbors) implement special additional methods like inverse_transform and kneighbors_graph, which require special inputs and/or outputs. Up until now, these methods have not been tested using the centralized testing suite (and may have not been tested at all). This PR dynamically checks and collects all methods to all patched estimators which contain "X" or "y" in the input into testing, increasing flexibility and reducing maintenance.

The additional code comes from having to fix issues related to taking in dpnp and dptcl inputs since they were previously untested.

Rather than using _device_offload.dispatch which is code heavy, cumbersome and with significant overhead, a limited use of get_namespace was added to these failing methods which provide the necessary pre- or post-processing of 'core' oneDAL methods.

With these additions, dpnp and dpctl inputs are almost fully covered in the codebase. One deselection was made in radius_neighbors and radius_neighbors_graph in the NearestNeighbors algorithm, as these would require significant work (i.e. the implementation of a new estimator) in order to properly support dpnp/dpctl inputs (unless we knowingly only allow CPU computation).

The listed changes are as follows:

  • rename ensemble algorithms check_sample_weight method to _check_sample_weight to make it a private method
  • Change IncrementalEmpiricalCovariance mahalanobis and score methods to accept dpnp/dpctl inputs (score was previously missed as this estimator has no Mixins)
  • Provide error to NearestNeighbors algorithm for use of radius_neighbors and radius_neighbors_graph for non-numpy/pandas inputs, which are unsupported.
  • Remove radius_neighbors from KNeighborsRegressor and KNeighorsClassifier, as they do not support it from sklearn, and it was purely using sklearn functionality
  • Fix kneighbors_graph method of NearestNeighbors, KNeighborsRegressor, and KNeighborsClassifier to take in dpnp/dpctl inputs, and return a scipy.csr_matrix (removed wrap output data, etc.)
  • Disable score_samples and score methods in stability testing for PCA (will add tickets to address these issues).
  • Introduce PCA.inverse_transform using get_namespace to insure components_ and mean_ are on the same device as X

NOTE: current private CI fails for PCA in the inverse_transform method in dpnp testing, this is due to an out-of-date version of dpnp, and is currently getting addressed. Private CI will be re-run as soon as it is updated.

icfaust avatar Jun 13 '24 05:06 icfaust