skops icon indicating copy to clipboard operation
skops copied to clipboard

Persistence tests fail for LogisticRegression et al. with multiclass classification

Open BenjaminBossan opened this issue 3 years ago • 7 comments

Right now, when testing persistence of classifiers, we create a binary classification task, and the classifiers all pass. However, when switching to a multiclass classification task, LogisticRegression and related estimators fail (e.g. CalibratedClassifierCV which uses lr under the hood by default).

To reproduce, replace the following lines:

https://github.com/skops-dev/skops/blob/2c2dd6eddbd88c4774d297c387d1169533246b97/skops/io/tests/test_persist.py#L421-L423

by these lines:

        X, y = make_classification(
            n_samples=N_SAMPLES, n_features=N_FEATURES, random_state=0, n_classes=3, n_redundant=1, n_informative=N_FEATURES - 1,
        )

(note that n_redundant and n_informative are irrelevant here, they just need to be changed for make_classification to work)

The error is that the contiguity of the coef_ attributes is not the same. Strangely enough, it is the original estimator that seems to be "wrong":

>>> estimator.coef_.flags
  C_CONTIGUOUS : False
  F_CONTIGUOUS : False
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False

>>> loaded.coef_.flags
  C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False

I haven't investigated further, but I suspect that lr uses a different algorithm under the hood when dealing with binary classification, which is why it only occurs in the multiclass setting. EDIT: See below, that's not the reason.

BenjaminBossan avatar Dec 01 '22 15:12 BenjaminBossan

Okay, I could identify where the contiguity is "lost":

First of all, the test:

@pytest.mark.parametrize('fit_intercept', [True, False], ids=['w/ interc', 'w/o interc'])
@pytest.mark.parametrize('multi_class', ['auto', 'ovr', 'multinomial'])
@pytest.mark.parametrize('binary', [True, False], ids=['two classes', 'three classes'])
def test_lr(binary, multi_class, fit_intercept):
    X = [[0, 1], [2, 3], [4, 5]]
    if binary:
        y = [0, 1, 1]
    else:
        y = [0, 1, 2]

    estimator = LogisticRegression(multi_class=multi_class, fit_intercept=fit_intercept)
    estimator.fit(X, y)

    trusted =['sklearn.linear_model._logistic.LogisticRegression']
    loaded = loads(dumps(estimator), trusted=trusted)
    assert_params_equal(loaded.__dict__, estimator.__dict__)

The error is:

FAILED skops/io/tests/test_persist.py::test_lr[three classes-auto-w/ interc] - assert True is False
FAILED skops/io/tests/test_persist.py::test_lr[three classes-ovr-w/ interc] - assert True is False
FAILED skops/io/tests/test_persist.py::test_lr[three classes-multinomial-w/ interc] - assert True is False

So it only happens when there are 3 (or more) classes and fit_intercept=True.

The "offending" line is:

https://github.com/scikit-learn/scikit-learn/blob/4dde64d61e6e172da466cc7b7eaa21b832b96fbb/sklearn/linear_model/_logistic.py#L1330

As you can see, this line is only called with fit_intercept=True. So that explains this part. But why does it only occur with 3 or more classes? Let me demonstrate:

import numpy as np

# FIRST: show that (m, n) and (1, n) matrices behave differently for contiguity

x = np.random.random((5, 3))  # coef with > 2 classes 2d
x.flags
#  C_CONTIGUOUS : True
#  F_CONTIGUOUS : False

x[:, :-1].flags
#  C_CONTIGUOUS : False
#  F_CONTIGUOUS : False

y = np.random.random((1, 3))  # coef with 2 classes is "1d"
y.flags
#  C_CONTIGUOUS : True
#  F_CONTIGUOUS : True

y[:, :-1].flags
#  C_CONTIGUOUS : True
#  F_CONTIGUOUS : True

# SECOND show that not-C not-F is lost  after saving, but both C and F is conserved

np.save('arr.npy', x[:, :-1])
xl = np.load('arr.npy')
#  C_CONTIGUOUS : True
#  F_CONTIGUOUS : False

np.save('arr.npy', y[:, :-1])
yl = np.load('arr.npy')
yl.flags
#  C_CONTIGUOUS : True
#  F_CONTIGUOUS : True

I bow to you if you already knew that :D

Anyway, how can we resolve this bug?

  1. I don't think that anything on the numpy side will be changed about how flags change with slicing, and I'm sure there is some deep logic about why this is the correct behavior.

  2. It's probably also unlikely that np.save/np.load will be changed to conserve contiguity when the array is neither C nor F contiguous.

  3. On the sklearn side, maybe one could argue that this behavior could result in a lower performance, not sure.

  4. On the skops side, we could ignore the contiguity in the unit test, but again, an error there could have performance implications.

  5. On the skops side, we could store the contiguity of numpy arrays and then after loading, enforce the contiguity -- but I'm not sure if it's possible to tell numpy to make an array neither C nor F contiguous / both C and F contiguous, AFAICT it's one or the other. Maybe it's possible in some hacky fashion.

BenjaminBossan avatar Dec 02 '22 15:12 BenjaminBossan

  1. On the skops side, we could ignore the contiguity in the unit test, but again, an error there could have performance implications.

I think this is an option, but it would definitely be a performance hit for users who depend on contiguous arrays. I also think that if a user stores something which is explicitly contiguous, they would expect it to persist that way, so I feel it's something we should solve in any case.

  1. On the skops side, we could store the contiguity of numpy arrays and then after loading, enforce the contiguity -- but I'm not sure if it's possible to tell numpy to make an array neither C nor F contiguous / both C and F contiguous, AFAICT it's one or the other. Maybe it's possible in some hacky fashion.

Arrays can definitely be both simultaneously (e.g 1D arrays, etc), e.g.:

>>> x = np.arange(9)
>>> x.flags
  C_CONTIGUOUS : True
  F_CONTIGUOUS : True

but outside of trivial examples, they can't be simultaneously C and F contiguous, as C contiguous is row-major, and F contiguous is column-major.

Arrays can also be discontinuous if they are made of strides, e.g:

>>> y = np.arange(9).reshape(3,3)[:,:2]
>>> y.flags
  C_CONTIGUOUS : False
  F_CONTIGUOUS : False

Though I don't know of a use case where a user would need an array to be discontiguous...

I will need to play around with the Numpy APIs a bit to see exactly how it handles trying to force either C or F contiguity. I think 1D arrays will always be either both C and F contiguous or discontiguous, and for non-trivial arrays, we could force either one or the other.

I think it would make sense to store the contiguity flags during persistance, then format arrays correctly on loading. I can play around with that a bit this weekend to see if I can get something working.

E-Aho avatar Dec 04 '22 11:12 E-Aho

I think this is an option, but it would definitely be a performance hit for users who depend on contiguous arrays.

I'm not an expert on numpy low level stuff, but purely conceptually, if you have code like:

self.coef_ = self.coef_[:, :-1]

where the last column is dropped, if it was C-contiguous before, would it not be "almost" contiguous still? Or would it treated like a randomly shuffled array? And if we load the array with skops and it is now C-contiguous, could there be any disadvantage? I would think probably not.

Investigating a bit further, I also found that pickling results in the same loss of preserving the contiguity:

estimator.coef_.flags
  C_CONTIGUOUS : False
  F_CONTIGUOUS : False

p = pickle.loads(pickle.dumps(estimator))
p.coef_.flags
  C_CONTIGUOUS : True
  F_CONTIGUOUS : False

Therefore, I think that we should be fine if going from neither C nor F contiguous to C contiguous.

Basically, what I think would be a good result is:

C contiguous before F contiguous before C contiguous after F contiguous after
T T T T
T F T F
F T F T
F F T F

I haven't checked yet if that needs any change in our code to make it happen, maybe only our testing code would need adjusting.

BenjaminBossan avatar Dec 05 '22 10:12 BenjaminBossan

I think we should persist as above, and I think the truth table you've provided would be a good idea to implement.

I really can't think of any use case where a user would need something to not be contiguous, so I think having the final column in your table should be fine.

E-Aho avatar Dec 05 '22 11:12 E-Aho

I'll have a deeper look here soon, but in the meantime, I just realized numpy uses numpy.core.multiarray._reconstruct to construct things during unpickling.

adrinjalali avatar Dec 06 '22 17:12 adrinjalali

I was gonna report a bug on the numpy side, but I realized this makes total sense. AFAICT, every time where the contiguous flag is different after load, it's also the case that OWNDATA is False before load, and it's True after load, which makes sense since these are more of a view on an existing array, and changing values on them would change values on the original array.

Which means we can have a weak test and check if OWNDATA=True, and if yes, then do a strict check, and if not, then we can't really know what comes out of the numpy.load.

I'd love it if @rgommers could confirm this.

adrinjalali avatar Jan 24 '23 14:01 adrinjalali

That sounds correct to me. You should be able to rely on the data values in pickled arrays, but AFAIK there is no way with pickling to preserve the view relationships between different arrays.

rgommers avatar Jan 24 '23 16:01 rgommers