scikit-learn-intelex icon indicating copy to clipboard operation
scikit-learn-intelex copied to clipboard

[enhancement] enable array_api return values from ```from_table```

Open icfaust opened this issue 8 months ago • 7 comments

Description

** Dependent on #2430 **

leaning into the functional style development for from_table, its is generalized to work with expected major frameworks (dpctl, dpnp, and various flavors of array_api arrays). This is done by changing the from_table API to depend not on kwargs sua_iface and sycl_queue but on a single keyword like, which can either be an array type or a callable which will return the expected array type. The returned array will preserve the aspects of the array (shape, type, row or column major contiguous, etc.) and in most circumstances the device. A major exception is sycl_usm cpu USM allocations, which require a manual memcpy to a CPU allocation.

Follow-on work must consistently apply the use of like as the roll-out of raw_inputs in #2153, as some estimators will force values to the return type and others not. As this is inconsistently applied, it must be fixed for any array_api-zero copy enabled estimator.

Changes:

  • Switched oneDAL table __sycl_usm_array_interface__ to throw an AttributeError if a queue is unavailable, rather than set to None. The attribute should only be available when a USM allocation is made, which is not the case for CPU data out of oneDAL. This check is placed early in the attribute code so that it can be used for checking for sycl usm conversion with has_attr
  • Created return_type_constructor function to generalize oneDAL table to array construction
  • Added and fixed documentation in _data_conversion.py
  • Modified functions to remove use of sua_iface and sycl_queue keywords to use like
  • Remove necessary imports of dpnp and dpctl/dpctl.tensor from this file through generalization in return_type_constructor

NOTES: The initial support version for from_table is not general enough and highly coded for sycl usm inputs. This changes the interface to mirror the type (and possibly queue) of the data. Getting this integrated will begin the fight against the technical debt introduced with use_raw_inputs. The core of this PR is simple, the changes needed associated with the technical debt will be brutal.

Incremental algorithms will first collect and store the return type generator function via the return_type_constructor function. This is similar to how the queue must be stored, which will be applied at the end at the finalization step. The serialization deserialization of this isn't possible, so it should be handled similarly as the queue.

If we want to type hint these changes, I need help. I'm not sure what to do for the specifics of stating an array type.

I don't think performance benchmarks are necessary, but can do upon request.


PR should start as a draft, then move to ready for review state after CI is passed and all applicable checkboxes are closed. This approach ensures that reviewers don't spend extra time asking for regular requirements.

You can remove a checkbox as not applicable only if it doesn't relate to this PR in any way. For example, PR with docs update doesn't require checkboxes for performance while PR with any change in actual code should have checkboxes and justify how this code change is expected to affect performance (or justification should be self-evident).

Checklist to comply with before moving PR from draft:

PR completeness and readability

  • [x] I have reviewed my changes thoroughly before submitting this pull request.
  • [x] I have commented my code, particularly in hard-to-understand areas.
  • [x] I have updated the documentation to reflect the changes or created a separate PR with update and provided its number in the description, if necessary.
  • [x] Git commit message contains an appropriate signed-off-by string (see CONTRIBUTING.md for details).
  • [x] I have added a respective label(s) to PR if I have a permission for that.
  • [x] I have resolved any merge conflicts that might occur with the base branch.

Testing

  • [x] I have run it locally and tested the changes extensively.
  • [x] All CI jobs are green or I have provided justification why they aren't.
  • [x] I have extended testing suite if new functionality was introduced in this PR.

icfaust avatar Apr 17 '25 18:04 icfaust

/intelci: run

icfaust avatar Apr 20 '25 20:04 icfaust

Codecov Report

Attention: Patch coverage is 76.92308% with 15 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
onedal/datatypes/_data_conversion.py 74.07% 5 Missing and 2 partials :warning:
onedal/linear_model/logistic_regression.py 0.00% 4 Missing :warning:
onedal/datatypes/sycl_usm/data_conversion.cpp 40.00% 0 Missing and 3 partials :warning:
onedal/datatypes/table.cpp 0.00% 0 Missing and 1 partial :warning:
Flag Coverage Δ
azure 79.87% <72.88%> (-0.06%) :arrow_down:
github 72.97% <76.92%> (-0.07%) :arrow_down:

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
onedal/cluster/dbscan.py 86.66% <100.00%> (ø)
onedal/covariance/covariance.py 82.22% <100.00%> (ø)
onedal/datatypes/__init__.py 100.00% <100.00%> (ø)
onedal/ensemble/forest.py 72.63% <100.00%> (ø)
onedal/linear_model/incremental_linear_model.py 92.10% <100.00%> (+0.51%) :arrow_up:
onedal/linear_model/linear_model.py 83.58% <100.00%> (ø)
onedal/datatypes/table.cpp 51.92% <0.00%> (ø)
onedal/datatypes/sycl_usm/data_conversion.cpp 33.33% <40.00%> (-7.37%) :arrow_down:
onedal/linear_model/logistic_regression.py 28.46% <0.00%> (+0.20%) :arrow_up:
onedal/datatypes/_data_conversion.py 81.08% <74.07%> (-11.23%) :arrow_down:

... and 2 files with indirect coverage changes

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov[bot] avatar Apr 20 '25 20:04 codecov[bot]

/intelci: run

icfaust avatar Apr 22 '25 12:04 icfaust

Fixes must be made to spmd testing :(

icfaust avatar Apr 23 '25 07:04 icfaust

This has implications about array-api-compat. I need to rethink this strategy.

icfaust avatar Apr 23 '25 15:04 icfaust

The situation is as follows. I didn't look thoroughly enough into the reasons behind array-api-compat and its inclusion into sklearn. Even if we fully support array_api inputs, it will not mean we will support pytorch out of the box. Pytorch support via array api comes via getting the namespace generating from array-api-compat, meaning manually searching for array_namespace won't work (and the reason why sklearn has get_namespace to compensate for that. Thus either we need to re-invent get_namespace in onedal, acquire array namespaces from sklearn and pass them around like we do for queues (not advised) having used sklearn's get_namespace or think of another way of doing it. The logic is insufficient for pytorch support.

icfaust avatar Apr 23 '25 15:04 icfaust

I'm going to have to modify the ingestion scripts (to_table) for non-contiguous inputs and dtype changes which we have a special case trigger in sycl-usm for dpnp, which we will also have to add similarly for pytorch...

icfaust avatar Apr 23 '25 15:04 icfaust

The situation is as follows. I didn't look thoroughly enough into the reasons behind array-api-compat and its inclusion into sklearn. Even if we fully support array_api inputs, it will not mean we will support pytorch out of the box. Pytorch support via array api comes via getting the namespace generating from array-api-compat, meaning manually searching for array_namespace won't work (and the reason why sklearn has get_namespace to compensate for that. Thus either we need to re-invent get_namespace in onedal, acquire array namespaces from sklearn and pass them around like we do for queues (not advised) having used sklearn's get_namespace or think of another way of doing it. The logic is insufficient for pytorch support.

I ended up doing a final catch for types with a lazy load of array_api_compat which may or may not be installed with sklearn. If someone tries to use sklearnex with other array types (like pytorch), then they will have installed compat anyway to interface with sklearn. If someone uses the oneDAL estimators directly, then they will need the warning for pytorch in the case that the sklearn install doesn't exist.

icfaust avatar Jun 27 '25 12:06 icfaust

/intelci: run

icfaust avatar Jun 27 '25 18:06 icfaust

/intelci: run

icfaust avatar Jun 27 '25 21:06 icfaust

/intelci: run

icfaust avatar Jun 28 '25 12:06 icfaust

/intelci: run

icfaust avatar Jun 28 '25 22:06 icfaust

/intelci: run

icfaust avatar Jun 29 '25 21:06 icfaust

@icfaust What happens if the 'X' and 'y' inputs are of different types, or if one is on device and another on host?

david-cortes-intel avatar Jun 30 '25 09:06 david-cortes-intel

@icfaust What happens if the 'X' and 'y' inputs are of different types, or if one is on device and another on host?

Good question, this goes to some underlying open questions about the onedal module estimators and their direct use. In sklearnex there are guards for this associated with dispatch() which can be seen/ are updated in #2569 (see _sycl_queue_manager.from_data). However, using the onedal estimators directly (which is the case for spmd as that code is currently completely unregulated), disregards finiteness, shape, input type, verification of the input queue, etc. In sklearnex these are handled by validate_data with some additional constraints as previously described.

With respect to the second question: if return data is on GPU, it will try to return GPU data to the array constructor. If input data was with a cpu sycl queue, it will attempt to copy the cpu queue from the input data (only pertinent for array types with __sycl_usm_ndarray_interface__ attributes.

Hope this helps.

icfaust avatar Jun 30 '25 12:06 icfaust

@icfaust Could you please add tests having (a) 'X' as numpy and 'y' as GPU array; (b) 'X' as GPU array and 'y' as numpy; (c) 'X' as sycl host array on torch and 'y' as sycl device array on torch.

I'm not sure if this is the most suitable PR though.

david-cortes-intel avatar Jun 30 '25 13:06 david-cortes-intel

@icfaust Could you please add tests having (a) 'X' as numpy and 'y' as GPU array; (b) 'X' as GPU array and 'y' as numpy; (c) 'X' as sycl host array on torch and 'y' as sycl device array on torch.

I'm not sure if this is the most suitable PR though.

Thank you for the suggestion, I will add testing to that effect in #2569

EDIT: this mixing of types will be only problematic for array_api_dispatch=True, but should be caught via the proper use of get_namespace. Likely a new design rule must be added that get_namespace must be run for all args, as that occurs independently for each estimator and method (at the moment). @david-cortes-intel

icfaust avatar Jul 01 '25 06:07 icfaust

/intelci: run

icfaust avatar Jul 07 '25 05:07 icfaust

/intelci: run

icfaust avatar Jul 07 '25 22:07 icfaust

/intelci: run

icfaust avatar Jul 08 '25 07:07 icfaust