cs-ranking icon indicating copy to clipboard operation
cs-ranking copied to clipboard

[WIP] Make our estimators compatible with scikit-learn

Open timokau opened this issue 4 years ago • 17 comments

Description

This is a work-in-progress of fixing https://github.com/kiudee/cs-ranking/issues/94. See that issue for motivation and context.

Currently the one added test is failing since FETADiscreteChoice does not yet conform to the interface. The commit is mostly just to get the PR started. I will incrementally fix the estimators and add them to the list of tested estimators.

Collecting and testing all of our estimators will be easier after #115 is done.

How Has This Been Tested?

Does this close/impact existing issues?

Fixes #94.

Types of changes

  • [ ] Bug fix (non-breaking change which fixes an issue)
  • [ ] New feature (non-breaking change which adds functionality)
  • [x] Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • [ ] My code follows the code style of this project.
  • [ ] My change requires a change to the documentation.
  • [ ] I have updated the documentation accordingly.
  • [x] I have added tests to cover my changes.

timokau avatar Apr 30 '20 22:04 timokau

Currently our FETANetwork base class requires the mandatory n_objects and n_object_features parameters. According to sklearns guidelines, those should be passed to fit instead. That effectively means that we should only construct the network on fit too. That will probably require quite some refactoring.

timokau avatar Apr 30 '20 22:04 timokau

Codecov Report

Merging #116 into master will increase coverage by 0.30%. The diff coverage is 0.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #116      +/-   ##
==========================================
+ Coverage   57.04%   57.35%   +0.30%     
==========================================
  Files         113      114       +1     
  Lines        6560     7027     +467     
==========================================
+ Hits         3742     4030     +288     
- Misses       2818     2997     +179     
Impacted Files Coverage Δ
csrank/tests/test_estimators.py 0.00% <0.00%> (ø)
csrank/discretechoice/ranknet_discrete_choice.py 83.78% <0.00%> (-16.22%) :arrow_down:
csrank/objectranking/baseline.py 55.00% <0.00%> (-14.24%) :arrow_down:
csrank/discretechoice/baseline.py 55.00% <0.00%> (-14.24%) :arrow_down:
csrank/discretechoice/cmpnet_discrete_choice.py 86.11% <0.00%> (-13.89%) :arrow_down:
csrank/objectranking/cmp_net.py 86.48% <0.00%> (-13.52%) :arrow_down:
csrank/objectranking/rank_net.py 86.48% <0.00%> (-13.52%) :arrow_down:
csrank/choicefunction/ranknet_choice.py 80.85% <0.00%> (-12.01%) :arrow_down:
csrank/core/cmpnet_core.py 87.50% <0.00%> (-10.21%) :arrow_down:
csrank/core/ranknet_core.py 87.50% <0.00%> (-10.21%) :arrow_down:
... and 62 more

codecov[bot] avatar Apr 30 '20 22:04 codecov[bot]

I've adapted FETALinear's init. I'll have to do the same for our other cores and estimators and then take care of the other parts of the scikit-learn estimator interface.

timokau avatar May 02 '20 15:05 timokau

This is turning out to be a lot more effort than expected. To make it somewhat manageable and reviewable I'll proceed as follows:

  • Gradually make all estimators pass the "default constructible" test. There is some highly repetitive work to be done here. The first part is moving all random state validation out of __init__. I'll split that repetitive work into separate PRs for easier review and faster merge.
  • Create a new test that calls check_estimator but blacklists all currently failing sub-checks.
  • Gradually fix the failing sub-checks, save for some which are inherently incompatible with our library

timokau avatar May 12 '20 14:05 timokau

The first part is ready for review: https://github.com/kiudee/cs-ranking/pull/117

timokau avatar May 12 '20 15:05 timokau

I have rebased this on top of #118. After #118, the biggest remaining blocker for the default-constructible is the optimizer parameter of our estimators. See #119 for that.

timokau avatar May 22 '20 19:05 timokau

Rebased, now that #118 is merged.

timokau avatar May 26 '20 13:05 timokau

I have continued work on top of #119 to avoid merge conflicts. We're getting very close to passing the default_constructible test! Besides https://github.com/scikit-learn/scikit-learn/issues/17756, with the fixes in this PR and #119 the only remaining failure is due to the kernel_regularizer parameter. We'll need to give it the same treatment as optimizer.

I already started by removing all default values for the regularizer parameters (https://github.com/kiudee/cs-ranking/pull/116/commits/82a2869dab2ef968c1673a69f9ecee73fa3a8299). @kiudee do you think that is the right call for the regularizers as well?

I had to patch scikit-learn to get more useful/actionable error messages in the test. I'll try to upstream those improvements once https://github.com/scikit-learn/scikit-learn/issues/17756 is resolved.

timokau avatar Jun 27 '20 17:06 timokau

All our estimators are default constructible now and the test passes with https://github.com/scikit-learn/scikit-learn/pull/17936 :tada:

That means that we cannot add the test before the PR is merged and a new sklearn release is out though.

timokau avatar Jul 16 '20 17:07 timokau

The PR was merged upstream. I'm not sure if it'll make it into 0.23.3 or if we have to wait for 0.24. Either way, there should be a stable version with the necessary patch at some point.

timokau avatar Aug 07 '20 18:08 timokau

After #154 and #155, this will be the status of the "no attributes in init" check:

AssertionError: Estimator <class 'csrank.choicefunction.cmpnet_choice.CmpNetChoiceFunction'> should not set any attribute apart from parameters during init. Found attributes ['threshold'].
AssertionError: Estimator <class 'csrank.choicefunction.fate_choice.FATEChoiceFunction'> should not set any attribute apart from parameters during init. Found attributes ['is_variadic', 'joint_layers', 'kernel_regularizer_', 'optimizer_', 'scorer', 'set_layer', 'threshold'].
AssertionError: Estimator <class 'csrank.choicefunction.fatelinear_choice.FATELinearChoiceFunction'> should not set any attribute apart from parameters during init. Found attributes ['bias1', 'bias2', 'current_lr', 'optimizer', 'weight1', 'weight2'].
AssertionError: Estimator <class 'csrank.choicefunction.feta_choice.FETAChoiceFunction'> should not set any attribute apart from parameters during init. Found attributes ['_pairwise_model', '_zero_order_model', 'threshold'].
AssertionError: Estimator <class 'csrank.choicefunction.fetalinear_choice.FETALinearChoiceFunction'> should not set any attribute apart from parameters during init. Found attributes ['W_last', 'bias1', 'bias2', 'current_lr', 'optimizer', 'w_out', 'weight1', 'weight2'].
AssertionError: Estimator <class 'csrank.choicefunction.generalized_linear_model.GeneralizedLinearModel'> should not set any attribute apart from parameters during init. Found attributes ['Xt', 'Yt', 'p', 'threshold', 'trace', 'trace_vi'].
AssertionError: Estimator <class 'csrank.choicefunction.pairwise_choice.PairwiseSVMChoiceFunction'> should not set any attribute apart from parameters during init. Found attributes ['threshold', 'weights'].
AssertionError: Estimator <class 'csrank.choicefunction.ranknet_choice.RankNetChoiceFunction'> should not set any attribute apart from parameters during init. Found attributes ['_scoring_model', 'threshold'].
AssertionError: Estimator <class 'csrank.discretechoice.fate_discrete_choice.FATEDiscreteChoiceFunction'> should not set any attribute apart from parameters during init. Found attributes ['is_variadic', 'joint_layers', 'kernel_regularizer_', 'optimizer_', 'scorer', 'set_layer'].
AssertionError: Estimator <class 'csrank.discretechoice.fatelinear_discrete_choice.FATELinearDiscreteChoiceFunction'> should not set any attribute apart from parameters during init. Found attributes ['bias1', 'bias2', 'current_lr', 'optimizer', 'weight1', 'weight2'].
AssertionError: Estimator <class 'csrank.discretechoice.feta_discrete_choice.FETADiscreteChoiceFunction'> should not set any attribute apart from parameters during init. Found attributes ['_pairwise_model', '_zero_order_model'].
AssertionError: Estimator <class 'csrank.discretechoice.fetalinear_discrete_choice.FETALinearDiscreteChoiceFunction'> should not set any attribute apart from parameters during init. Found attributes ['W_last', 'bias1', 'bias2', 'current_lr', 'optimizer', 'w_out', 'weight1', 'weight2'].
AssertionError: Estimator <class 'csrank.discretechoice.generalized_nested_logit.GeneralizedNestedLogitModel'> should not set any attribute apart from parameters during init. Found attributes ['Xt', 'Yt', '_config', 'p', 'threshold', 'trace', 'trace_vi'].
AssertionError: Estimator <class 'csrank.discretechoice.mixed_logit_model.MixedLogitModel'> should not set any attribute apart from parameters during init. Found attributes ['Xt', 'Yt', '_config', 'p', 'trace', 'trace_vi'].
AssertionError: Estimator <class 'csrank.discretechoice.multinomial_logit_model.MultinomialLogitModel'> should not set any attribute apart from parameters during init. Found attributes ['Xt', 'Yt', '_config', 'p', 'trace', 'trace_vi'].
AssertionError: Estimator <class 'csrank.discretechoice.nested_logit_model.NestedLogitModel'> should not set any attribute apart from parameters during init. Found attributes ['Xt', 'Yt', '_config', 'cluster_model', 'features_nests', 'p', 'threshold', 'trace', 'trace_vi', 'y_nests'].
AssertionError: Estimator <class 'csrank.discretechoice.paired_combinatorial_logit.PairedCombinatorialLogit'> should not set any attribute apart from parameters during init. Found attributes ['Xt', 'Yt', '_config', 'p', 'trace', 'trace_vi'].
AssertionError: Estimator <class 'csrank.discretechoice.pairwise_discrete_choice.PairwiseSVMDiscreteChoiceFunction'> should not set any attribute apart from parameters during init. Found attributes ['weights'].
AssertionError: Estimator <class 'csrank.discretechoice.ranknet_discrete_choice.RankNetDiscreteChoiceFunction'> should not set any attribute apart from parameters during init. Found attributes ['_scoring_model'].
AssertionError: Estimator <class 'csrank.objectranking.expected_rank_regression.ExpectedRankRegression'> should not set any attribute apart from parameters during init. Found attributes ['weights'].
AssertionError: Estimator <class 'csrank.objectranking.fatelinear_object_ranker.FATELinearObjectRanker'> should not set any attribute apart from parameters during init. Found attributes ['bias1', 'bias2', 'current_lr', 'optimizer', 'weight1', 'weight2'].
AssertionError: Estimator <class 'csrank.objectranking.fate_object_ranker.FATEObjectRanker'> should not set any attribute apart from parameters during init. Found attributes ['is_variadic', 'joint_layers', 'kernel_regularizer_', 'optimizer_', 'scorer', 'set_layer'].
AssertionError: Estimator <class 'csrank.objectranking.fetalinear_object_ranker.FETALinearObjectRanker'> should not set any attribute apart from parameters during init. Found attributes ['W_last', 'bias1', 'bias2', 'current_lr', 'optimizer', 'w_out', 'weight1', 'weight2'].
AssertionError: Estimator <class 'csrank.objectranking.feta_object_ranker.FETAObjectRanker'> should not set any attribute apart from parameters during init. Found attributes ['_pairwise_model', '_zero_order_model'].
AssertionError: Estimator <class 'csrank.objectranking.list_net.ListNet'> should not set any attribute apart from parameters during init. Found attributes ['_scoring_model'].
AssertionError: Estimator <class 'csrank.objectranking.rank_net.RankNet'> should not set any attribute apart from parameters during init. Found attributes ['_scoring_model'].
AssertionError: Estimator <class 'csrank.objectranking.rank_svm.RankSVM'> should not set any attribute apart from parameters during init. Found attributes ['weights'].

timokau avatar Aug 28 '20 15:08 timokau

Good news and bad news.

Good news: After #159, we are now compliant with "no attributes in init" :tada:

Bad news: Basically all other checks are unusable as long as our fit function doesn't match the data dimensionality that scikit-learn assumes. I have updated this PR and blacklisted all the checks that do not work. So we either have to think again about how we can make our fit function fit (pun intended) or ignore all those checks.

timokau avatar Sep 17 '20 12:09 timokau

Just to re-state the problem (we talked about this in some other PR/issue, but I can't remember which):

According to scikit-learn, fit should expect an X of the shape (n_samples, n_features). However, for us every sample consists of a raking which has multiple objects. Therefore, X has the shape (n_instances, n_objects, n_features).

It should be possible to flatten each ranking instance into a single feature array to please scikit-learn. We could write functions to transform between the two formats. The question is if that is what we want.

CC @kiudee @prithagupta

timokau avatar Sep 17 '20 12:09 timokau

Good news: After #159, we are now compliant with "no attributes in init" 🎉

🥳

Bad news: Basically all other checks are unusable as long as our fit function doesn't match the data dimensionality that scikit-learn assumes. I have updated this PR and blacklisted all the checks that do not work. So we either have to think again about how we can make our fit function fit (pun intended) or ignore all those checks.

I think that (writing converters and having flat format be default) would compromise a bit too much. Our setting is a different one and it would not make sense to press it into that format. It would only buy us compatibility with a few preprocessors (let me know if I forgot an important use case), which for our setting likely do not make sense. Anything important we lose?

kiudee avatar Sep 17 '20 14:09 kiudee

I tend to agree. I will look into writing a wrapper anyway, which will at least enable us to make use of the remaining checks.

timokau avatar Sep 18 '20 11:09 timokau

Okay, I have done that. Just for rankers for now, I'll fix the checks there first (of course they don't all pass). I already added some little fixes to this PR. The checks are very slow though, so we'll probably not run them with the regular testsuite.

timokau avatar Sep 18 '20 15:09 timokau

Now that we use poetry & nix, I have included the environment with the patched scikit-learn that is necessary to run these tests. A simple nix-shell will load it (though it will require some build time). It includes a backport of https://github.com/scikit-learn/scikit-learn/pull/17936.

timokau avatar Oct 01 '20 14:10 timokau