metric-learn
metric-learn copied to clipboard
2. [MRG] Bilinear similarity
Hi!
In the aim of implementing OASIS algorithm, I open this PR to discuss the implementation of the BilinearMixin class. Which is a requirement for the algorithm.
I've made some testing but in a file outside the test folder, as I am not familiar with the correct way to test things out. These are basic testing for bilinear similarity with random and handmade toy examples.
As the matrix W in the bilinear similarity is not guaranteed to be symmetric nor PSD, I did not extend from the class MetricTransformer, and did not implemented the transform()
method.
Also, for score_pairs()
I propose two implementations, and I am not sure which one perms better.
Note: along the way this fixes #160
You should also ensure the compatibility with pairs, triplets and quadruplet classifiers to make sure we can then build actual bilinear similarity learning algorithms. Probably many existing tests for these will just naturally pass with bilinear similarity, while others may need to be adapted
@bellet @perimosocordiae @wdevazelhes @terrytangyuan
There needs to be a broader discussion regarding adding BilinearSimilarity and Bilinear Similarity Learners such as OASIS.
By definition bilinear similarity works in the opposite direction as a metric such as the Mahalanobis.
Given two vectors u, v
in Rn
, if their bilinear similarity S(u,v)
is positive and huge, it means the vectors are closer. On the opposite, if S(u,v)
is negative, it means the vectors point in different directions, and are far away. In contrast with a metric where a huge value means the vectors are far far away.
Regarding the API, it means that score_pairs()
works semantically in opposite directions for MahalanobisMixin
and BilinearMixin
while both implement it, as both inherit from BaseMetricLearner
.
There are several paths that can be followed to deal with this:
- Add a -1 multiplication at the result of
score_pairs
inBilinearMixin
, so the global semantic of that function across the whole package remains the same. And add a warning at the documentation. - Don't change
score_pairs()
and return the original bilinear similarity: This may conflict when comparing different models, as the user needs to take cake of making an exceptional case for Bilinear learners considering that huge values ofscore_pairs
means closer points. - Don't inherit
BilinearMixin
fromBaseMetricLearner
as the linear similarity is not a metric, its semantically incorrect. Create a whole different class/part of the code forSimilarityLearners
- Other options I don't see right now
This is also a concern for Classifiers
, as they rely on score_pairs()
at decision_function()
(and at predict()
as consequence) for verifying triplets/cuadriplets. And measuring distance for pairs.
For example, if we choose 2) and not change score_pairs()
for bilinear, we will have to use isinstance()
at classifiers to make the difference for MahalanobiasMixin
and the BilinearMixin
. Another option is to build almost the same classifiers, but for BilinearMixin
exclusively, duplicating a lot of code.
The decision made now, might imply many things for the future, moreover if there comes more bilinear learners implementations in the future, or if there are other similarities apart from bilinear and learners for those other similarities.
My humble opinion: I believe its more important to keep the global semantic of score_pairs()
, even when the "metric" learned is in fact a similarity. I bet for the -1 multiplier in BilinearMixin and rely that the final user will see a huge and clear warning at Bilinear Learners. Maybe implement an extra function that returns the original similarities (without -1 mult.) although that idea seems silly.
Ps: By "semantic" I mean the meaning of what the function is intended to do. For instance, the current semantic of score_pairs()
means that the lower the output value, the closer the pairs.
@bellet @perimosocordiae @wdevazelhes @terrytangyuan
There needs to be a broader discussion regarding adding BilinearSimilarity and Bilinear Similarity Learners such as OASIS.
By definition bilinear similarity works in the opposite direction as a metric such as the Mahalanobis.
Given two vectors
u, v
inRn
, if their bilinear similarityS(u,v)
is positive and huge, it means the vectors are closer. On the opposite, ifS(u,v)
is negative, it means the vectors point in different directions, and are far away. In contrast with a metric where a huge value means the vectors are far far away.Regarding the API, it means that
score_pairs()
works semantically in opposite directions forMahalanobisMixin
andBilinearMixin
while both implement it, as both inherit fromBaseMetricLearner
.There are several paths that can be followed to deal with this:
- Add a -1 multiplication at the result of
score_pairs
inBilinearMixin
, so the global semantic of that function across the whole package remains the same. And add a warning at the documentation.- Don't change
score_pairs()
and return the original bilinear similarity: This may conflict when comparing different models, as the user needs to take cake of making an exceptional case for Bilinear learners considering that huge values ofscore_pairs
means closer points.- Don't inherit
BilinearMixin
fromBaseMetricLearner
as the linear similarity is not a metric, its semantically incorrect. Create a whole different class/part of the code forSimilarityLearners
- Other options I don't see right now
This is also a concern for
Classifiers
, as they rely onscore_pairs()
atdecision_function()
(and atpredict()
as consequence) for verifying triplets/cuadriplets. And measuring distance for pairs.For example, if we choose 2) and not change
score_pairs()
for bilinear, we will have to useisinstance()
at classifiers to make the difference forMahalanobiasMixin
and theBilinearMixin
. Another option is to build almost the same classifiers, but forBilinearMixin
exclusively, duplicating a lot of code.The decision made now, might imply many things for the future, moreover if there comes more bilinear learners implementations in the future, or if there are other similarities apart from bilinear and learners for those other similarities.
My humble opinion: I believe its more important to keep the global semantic of
score_pairs()
, even when the "metric" learned is in fact a similarity. I bet for the -1 multiplier in BilinearMixin and rely that the final user will see a huge and clear warning at Bilinear Learners. Maybe implement an extra function that returns the original similarities (without -1 mult.) although that idea seems silly.Ps: By "semantic" I mean the meaning of what the function is intended to do. For instance, the current semantic of
score_pairs()
means that the lower the output value, the closer the pairs.
@mvargas33 I agree with you, IMHO I think it would be good to reuse the code and existing classes/methods as much as possible:
- Regarding whether to multiply by -1 or not for
score_pairs
forBilinearMixin
,I agree it would be good, so that indeedscore_pairs
is "in the same direction" for all metric learners, which would make the code forClassifiers
keep working forBilinear
learners, and allow users to write some generic code withscore_pairs
. I agree that, as a consequence, we would need to ensure that users know what they're doing when usingscore_pairs
, and a warning in the documentation sounds good. → Maybe also an idea for even more clarity would be to revisitscore_pairs
by changing its name to something more self-explanatory likepairs_remoteness
(this is probably a bad name but any synonym of distance would work (distance itself may maybe not be ideal if we want to return things that are not true distances), to keep the idea "small=closer, big=further away). And maybe having the opposite function as you suggested like saypairs_similarity
would be good too: it can be good because in the documentation for the instantiated versions of those methods, we would for instance say for Bilinear learners thatpairs_similarity
is the actual bilinear similarity (from the math formula), and for Mahalanobis learners thatpairs_remoteness
is the actual mahalanobis distance (from the math formula), and for the remaining functions (pairs_remoteness
for Bilinear learners andpairs_similarity
for the Mahalanobis learners), we would just say it's the opposite of the other one, given for convenience. This way may be a good compromise between having on one hand a generic code (useful for us in classes likeClassifier
and for users who write generic code), and on the other hand being very transparent and clear to the user on what function does what And this way we may even not need the warning. But maybe having two methods is a bit too much, I'm not sure, I guess either way are good (2 methods, or 1 method with the right warning in docs)
(Note: also regarding the name pairs_remoteness
, I guess we could either have a generic name like this and say for Mahalanobis learners that it is equal to the true distance, OR have a precise name like pairs_distance
and add a warning that says for Bilinar
learners that it is not a true distance (in the mathematical sense, since (- bilinear similarity) can be negative) , I guess either way works)
NB: Probably the get_metric
function would need to be adapted too, the same way as score_pairs
?
- Regarding inheriting from
BaseMetricLearner
, I think it would be good indeed, we can even seeBaseMetricLearner
not as a generic class for distance metric learners, but like for learners of the package "metric-learn" :D, so in this case Bilinear learners fits in ;) (Or otherwise we could rename that abstract class too into justBaseLearner
for instance ;) )
@wdevazelhes
I like the idea of pairs_remoteness
and pairs_similarity
, because it adds one layer of abstraction, making the code more generic and being able to support metrics and similarites without complicating things too much. No change will be needed in Classifiers
that way.
Even if this looks like an overkill right now, it guarantees that the code will be scalable and easily extendible in the future. For instance, imagine there are 5 bilinear similarity learners at some point, How this deals with the solution of -1 multiplier + warning? For me that sounds more like a patch.
Or what if there's support for data-structure similarities in the future, like tree-similarities or string-similarities. This adds more similarity types that could be easily handled with pairs_remoteness
and pairs_similarity
, no matter what the algorithm actually calculates in the end: closeness or remoteness.
I also like the idea of renaming BaseMetricLearner
to BaseLearner
, again, because the second its more abstract and semantically correct with metrics and similarities.
@mvargas33 That sounds good ! Yes that's right, having two general functions like this would make the code quite generic and extensible
Hi, thanks @mvargas33 and @wdevazelhes for kicking off the discussion. Here are some comments:
- I agree that we should act at the level of the abstract class
BaseMetricLearner
and make the semantic of the abstract methods consistent, so that everything downstream (e.g., classifiers) work seamlessly. - I think we chose the current name of the method
score_pairs
in analogy to sklearn'sscore
method, but somehow the semantic is reversed compared to the latter when considering Mahalanobis learners (larger score means less positive), which is counter-intuitive. In hindsight I don't think we did the right choice here and I am in favor of changing this one way or another. - Given the above, we have several choices.
- Either we stick to
score_pairs
but adjust the semantic to always follow the convention "larger = more similar". This is the easiest change but it means changing the current behavior for Mahalanobis learners, which may break existing code (of course we can add warnings but this may not be enough for people to become aware of the problem). - Or we decide it is better to change the name of the method (and deprecate
score_pairs
, giving users time to adapt). We could usepair_score
orpairwise_score
if we want to stick close to what we had. As suggested by @wdevazelhes, we could also consider introducing two methods (one for similarity and one for dissimilarity/distance). Maybepair(wise)_score
could be the default and we could implementpair(wise)_distance
only when the metric is actually a true (pseudo) distance? Otherwise, the two methods may be the same in practice up to a minus sign, which is not all that useful. What do you think?
- Either we stick to
- "Metric learning" have been used extensively in the literature as broad terms encompassing both metric learning and similarity learning. Note incidentally that in general Mahalanobis "distances" are not true metrics (only pseudo-distances). I think we should follow this terminology (probably explain this clearly in the doc), and thus for instance keep the name
BaseMetricLearner
(the nameBaseLearner
becomes way too generic IMO). After all, the library is calledmetric_learn
:-) - Similarly, I am not convinced at this point that
get_metric
needs to be changed. BTW, is there any sklearn estimator that has an argument for providing a similarity instead of a metric?
Would be great to hear @perimosocordiae and @terrytangyuan's take on this!
@bellet Yes I agree the deprecation sounds good, to force users to change their code, and I agree one general method that every learner would implement (like similarity, which respects more the meaning of a "score" indeed) may be enough for having generic code (for us, and for our users), and then a method like pairwise_distances
for (pseudo)distance metric learners is indeed cool to have: (they even have such a function with the same name in sklearn too: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise_distances.html
)
- For
get_metric
you're right, in the end this would be mostly be used as a function in sklearn's estimators so no real need to extend it I guess - Regarding your question about whether sklearn's estimators sometimes take in a similarity: I just checked and it seems it can, with for instance
GaussianProcesses
(see https://scikit-learn.org/stable/modules/gaussian_process.html#kernels-for-gaussian-processes): there's even an interesting paragraph in the user guide about API, I didn't fully understood how it all works but I seems like it could definitely be a source of inspiration:All Gaussian process kernels are interoperable with sklearn.metrics.pairwise and vice versa: instances of subclasses of Kernel can be passed as metric to pairwise_kernels from sklearn.metrics.pairwise. Moreover, kernel functions from pairwise can be used as GP kernels by using the wrapper class PairwiseKernel.
Also, it seems that bilinear metrics learn in fact a linear kernel in a linearly transformed space is that right ? If yes I guess it'd be cool if they return an object that can be used as a kernel in scikit-learnGaussianProcess
! Also, this part of the documentation seems interesting: https://scikit-learn.org/stable/modules/gaussian_process.html#gaussian-process-kernel-api It seems that in returning the kernel object, it is even possible to specify what are some hyperparameters (in this case maybe the matrix A in xTAy ?), and what is the gradient of the kernel function with respect to them: this may mean that scikit-learn could fine-tune a kernel previously learnt by say OASIS, in order to optimize a gaussian process, (not sure if that would work, even if that would be useful in some real life examples, but it definitely shows the power of the API!)
NB: this is also another scikit-learn doc that may be useful: https://scikit-learn.org/stable/modules/metrics.html#metrics
@wdevazelhes the thing is that a bilinear similarity is not guaranteed to yield a data transformation (and thus be a valid kernel). This is only true when the matrix is PSD, which is often not enforced by the similarity learning algorithms to avoid the computational cost with the projection into the PSD cone. So basically, a bilinear metric learner will not always be a transformer or valid kernel (and in practice, may often not be). We could still implement transform
and raise an error when the matrix is not PSD, but this may be confusing for users?
@mvargas33 at this point of the discussion I think we all agree that it is a good idea to:
- deprecate the
score_pairs
method - add instead a
pairwise_similarities
method which follows the convention (the higher, the more similar) - add a
pairwise_distances
method which is only implemented for metric learner that learn an actual (pseudo) distance, ie implemented for Mahalanobis but not bilinear similarity.
So perhaps you can go ahead with the above changes, and we can continue the discussion a bit regarding what to do for bilinear similarity (do we want to have specific things for when the matrix is actually PSD). @terrytangyuan @perimosocordiae your opinion is most welcome!
@wdevazelhes the thing is that a bilinear similarity is not guaranteed to yield a data transformation (and thus be a valid kernel). This is only true when the matrix is PSD, which is often not enforced by the similarity learning algorithms to avoid the computational cost with the projection into the PSD cone. So basically, a bilinear metric learner will not always be a transformer or valid kernel (and in practice, may often not be). We could still implement
transform
and raise an error when the matrix is not PSD, but this may be confusing for users?@mvargas33 at this point of the discussion I think we all agree that it is a good idea to:
- deprecate the
score_pairs
method- add instead a
pairwise_similarities
method which follows the convention (the higher, the more similar)- add a
pairwise_distances
method which is only implemented for metric learner that learn an actual (pseudo) distance, ie implemented for Mahalanobis but not bilinear similarity.So perhaps you can go ahead with the above changes, and we can continue the discussion a bit regarding what to do for bilinear similarity (do we want to have specific things for when the matrix is actually PSD). @terrytangyuan @perimosocordiae your opinion is most welcome!
@bellet that's right, sorry ! I didn't realize the PSD condition is not necessarily verified here I agree with the points above, for the specific case where the matrix is actually PSD, yes I agree that may confuse the users a bit, it may feel weird to have methods that sometimes work and sometimes (often) not
This is always a feature we can add later if we feel it is needed. @mvargas33 I think we have some kind of consensus now so you may go ahead and start implementing these changes
This is always a feature we can add later if we feel it is needed. @mvargas33 I think we have some kind of consensus now so you may go ahead and start implementing these changes
Yeah, if needed in the future, this particular feature should not take too much time to implement.
Btw, I already implemented these changes at #333 . All tests pass. The deprecation warning is being shown when the user directly call score_pairs
. Classifiers use pair_distance
now and in the docs, all practical usage of score_pairs
was replaced with pair_distance
@bellet @perimosocordiae @wdevazelhes @terrytangyuan
This branch is ready for a code review. It would be cool if can spare some time to look at it.
Here's a summary of the changes:
- Introduction of
BilinearMixin
inbase_metric.py
. - Documentation for future Bilinear Similarity learners in
introduction.rst
,supervised.rst
andweakly_supervised.rst
. The idea is to generalize the learner concept and to make crystal clear that not all learners havepair_distance
, but everyone haspair_score
.
And the summary for the tests:
- Made mock objects for
BilinearMixin
at all levels, including tuple learners. This, to guarantee that in any case in the future, the behavior of the tests is the same for any kind of Bilinear similarity learners that might be implemented. Taking into account that implementing one algorithm changesfit
mainly. - With these mock learners, there are several tests in
test_bilinear_mixin.py
to test the correctness of theBilinearMixin
. Some of them are hardcoded for symmetry reasons. -
Refactor of algorithm lists: There is a list of Mahalanobis and Bilinear learners separately, and together. This solves #160 and make the tests scalable. Only Mahalanobis learners are tested in
test_mahalanobis_mixin.py
thanks to the list, and for instance, everything inside testpairs, triplets and quadruplets_classifiers.py
are tested with both kind of learners. - Tests refactors: Some tests call transform and
pair_distance
inside, so I had to duplicate the test, to test these methods separately for Mahalanobis learners. For instancetest_array_like_inputs_mahalanobis
andtest_same_with_or_without_preprocessor_mahalanobis
. - There are also some test that are identical for Mahalanobis and Bilinear learners, but they are in their respective test file. Should we have a test file for tests regarding all kinds of learners? Now there is none, but all learners are tested in classifiers tests. Take as an example:
test_pair_score_dim
andtest_deprecated_score_pairs_same_result
.
And after digging deep in the tests these days, the parametrization list mentioned at #136 is already live for most tests. But parametrization for metric_learn_test
, test_fit_transform
and test_transformer_metric_conversion
is missing.
I think that parametrize fit_transform
is not that hard with the lists as they are, and with build_dataset
already in test_utils.py
. But I'll prioritize #330 for now, the PR just after this one.
Best! 😁
* There are also some test that are identical for Mahalanobis and Bilinear learners, but they are in their respective test file. Should we have a test file for tests regarding all kinds of learners? Now there is none, but all learners are tested in classifiers tests. Take as an example: `test_pair_score_dim` and `test_deprecated_score_pairs_same_result`.
Yes, tests that are common to Mahalanobis and Bilinear linears (and thus most likely to any metric learner) should not be duplicated and placed in test_base_metric.py
instead
Here's a review for the documentation part.
Generally we need to change a bit the vocabulary to accommodate both Mahalanobis and bilinear similarities. As common in the literature I propose to use the term "metric" as a generic term (encompassing both distances and similarities).
In addition to the changes mentioned below, some changes are needed in the beginning of "What is metric learning?":
Sec 1
* Many approaches in machine learning require a measure of distance (or similarity) between data points. * choose a standard distance metric --> choose a standard metric * Distance metric learning (or simply, metric learning) --> Metric learning * task-specific distance metrics --> task-specific metrics * The learned distance metric --> the learned metric
Sec 1.1:
* the goal in this setting is to learn a distance metric --> a metric * the goal is to learn a distance metric --> a metric * find the parameters of a distance function --> of a metric function
Please do spellcheck to avoid typos!
All observations were resolved for introduction.rst
, supervised.rst
and weakly_supervised.rst
.
Will run the spellchecker soon as well as moving the common tests to test_base_metric.py
- I ran the spellcheck and fixed a few typos.
- I moved
test_pair_score_dim
andtest_deprecated_score_pairs_same_result
totest_base_metric.py
- I refactored
test_same_with_or_without_preprocessor
to run with all metric learners and usedisinstance(estimator, MahalanobisMixin)
to extend it fortransform
andpair_distance
, but avoiding having a duplicated/separated test for Mahalanobis/Bilinear only.
@wdevazelhes @perimosocordiae @terrytangyuan it would be great if one of you can review this PR as well in the next couple of weeks (@mvargas33 will be off for ~2 weeks - so he can finalize the PR once he is back)
I resolved merge conflicts, but there are still a bunch of review items to address here.