scikit-learn-intelex icon indicating copy to clipboard operation
scikit-learn-intelex copied to clipboard

[bugfix, enhancement] Address affinity bug by using threadpoolctl/joblib for n_jobs dispatching

Open icfaust opened this issue 9 months ago • 13 comments

Description

This addresses an issue related to linux thread control using affinity. New tests added which validated n_jobs when a reduced number of threads are available, set by the affinity. This requires 4 threads/CPU cores be made available to the main python pytest process in order to have proper functionality, meaning that this test does not run on azure pipelines runners, but only on intelci and github actions.

This change was needed in order to have scikit-learn-intelex properly work on thread-limited Kubernetes pods.


PR should start as a draft, then move to ready for review state after CI is passed and all applicable checkboxes are closed. This approach ensures that reviewers don't spend extra time asking for regular requirements.

You can remove a checkbox as not applicable only if it doesn't relate to this PR in any way. For example, PR with docs update doesn't require checkboxes for performance while PR with any change in actual code should have checkboxes and justify how this code change is expected to affect performance (or justification should be self-evident).

Checklist to comply with before moving PR from draft:

PR completeness and readability

  • [ ] I have reviewed my changes thoroughly before submitting this pull request.
  • [ ] I have commented my code, particularly in hard-to-understand areas.
  • [ ] I have updated the documentation to reflect the changes or created a separate PR with update and provided its number in the description, if necessary.
  • [ ] Git commit message contains an appropriate signed-off-by string (see CONTRIBUTING.md for details).
  • [ ] I have added a respective label(s) to PR if I have a permission for that.
  • [ ] I have resolved any merge conflicts that might occur with the base branch.

Testing

  • [ ] I have run it locally and tested the changes extensively.
  • [ ] All CI jobs are green or I have provided justification why they aren't.
  • [ ] I have extended testing suite if new functionality was introduced in this PR.

Performance

  • [ ] I have measured performance for affected algorithms using scikit-learn_bench and provided at least summary table with measured data, if performance change is expected.
  • [ ] I have provided justification why performance has changed or why changes are not expected.
  • [ ] I have provided justification why quality metrics have changed or why changes are not expected.
  • [ ] I have extended benchmarking suite and provided corresponding scikit-learn_bench PR if new measurable functionality was introduced in this PR.

icfaust avatar Mar 17 '25 22:03 icfaust

Codecov Report

:white_check_mark: All modified and coverable lines are covered by tests.

Flag Coverage Δ
azure 80.64% <ø> (-0.09%) :arrow_down:
github 73.21% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov[bot] avatar Mar 18 '25 15:03 codecov[bot]

/intelci: run

icfaust avatar Mar 18 '25 22:03 icfaust

/azp run CI

icfaust avatar Mar 19 '25 06:03 icfaust

Azure Pipelines successfully started running 1 pipeline(s).

azure-pipelines[bot] avatar Mar 19 '25 06:03 azure-pipelines[bot]

CI failures are from test_memory_usage.py. They might be just sporadic issues, but given the numbers it shows:

E           AssertionError: Size of extra allocated memory after using garbage collector is greater than 15.0% of input data
E             	Algorithm: IncrementalBasicStatistics(result_options=['min', 'max', 'sum', 'mean',
E                                                        'variance', 'variation',
E                                                        'sum_squares', 'standard_deviation',
E                                                        'sum_squares_centered',
E                                                        'second_order_raw_moment'])
E             	Input data size: 808000 bytes
E             	Extra allocated memory size: 5297461 bytes / 655.63 %
E           assert 5297461 < (0.15 * 808000)

It looks like this PR could be making the algorithms run with more threads than they were using before. Is that intended?

david-cortes-intel avatar Mar 20 '25 07:03 david-cortes-intel

CI failures are from test_memory_usage.py. They might be just sporadic issues, but given the numbers it shows:

E           AssertionError: Size of extra allocated memory after using garbage collector is greater than 15.0% of input data
E             	Algorithm: IncrementalBasicStatistics(result_options=['min', 'max', 'sum', 'mean',
E                                                        'variance', 'variation',
E                                                        'sum_squares', 'standard_deviation',
E                                                        'sum_squares_centered',
E                                                        'second_order_raw_moment'])
E             	Input data size: 808000 bytes
E             	Extra allocated memory size: 5297461 bytes / 655.63 %
E           assert 5297461 < (0.15 * 808000)

It looks like this PR could be making the algorithms run with more threads than they were using before. Is that intended?

Good question, I will play around with the PR since its still a draft, but I had thought that I was limiting the number of threads since this oddity will set the number of threads to 100: https://github.com/uxlfoundation/scikit-learn-intelex/blob/main/sklearnex/tests/test_run_to_run_stability.py#L59 I think this is called at test collection time, meaning that it is set before test_memory_usage. I'll set it back to 100 and see what happens, if it passes, I'll make a ticket to investigate it.

icfaust avatar Mar 20 '25 12:03 icfaust

/intelci: run

icfaust avatar Mar 23 '25 22:03 icfaust

/intelci: run

icfaust avatar Mar 24 '25 10:03 icfaust

This last run is very useful/interesting. it shows that:

  1. linux and windows
  2. both GitHub actions and azure pipelines
  3. for daal4py (logistic regression) and onedal
  4. many different estimators.
  5. Different input datatypes

im very worried about our malloc routines in daal. Thoughts @david-cortes-intel @Vika-F

icfaust avatar Jun 13 '25 16:06 icfaust

This last run is very useful/interesting. it shows that:

  1. linux and windows
  2. both GitHub actions and azure pipelines
  3. for daal4py (logistic regression) and onedal
  4. many different estimators.
  5. Different input datatypes

im very worried about our malloc routines in daal. Thoughts @david-cortes-intel @Vika-F

Perhaps there could be memory leaks, but I'm not sure that we can conclude anything from these tests alone.

I see that some are for cases where the input is from DPCTL - does it set flags like SYCL_PI_LEVEL_ZERO_DISABLE_USM_ALLOCATOR for example? Might also be a better idea to set PYTHONMALLOC=malloc, and maybe LD_PRELOAD=jemalloc.so on linux, given the way in which the test is structured.

But what'd be even better is to run them through valgrind and asan to see if there are any reports of leaks coming specifically from oneDAL or sklearnex. We unfortunately get a lot of false positives from numpy, scipy, pybind11, and others that make it hard to browse logs; but true leaks should definitely show up there.

david-cortes-intel avatar Jun 16 '25 06:06 david-cortes-intel

@Mergifyio update

Alexsandruss avatar Sep 05 '25 09:09 Alexsandruss

update

❌ Pull request can't be updated with latest base branch changes

Mergify needs the author permission to update the base branch of the pull request. @icfaust needs to authorize modification on its head branch.

mergify[bot] avatar Sep 05 '25 09:09 mergify[bot]

@Alexsandruss I updated, let me know if there is anything I can do to help.

icfaust avatar Sep 08 '25 09:09 icfaust