metric-learn icon indicating copy to clipboard operation
metric-learn copied to clipboard

Warning on RCA with chunk labeling not starting on zero or with gaps

Open grudloff opened this issue 5 years ago • 2 comments

Description

RCA chunks are expected to start at zero and increase one by one, this raises a warning in case it doesn't start at zero or has any gap. Although these warnings are raised this doesn't affect the result of the RCA fit.

Maybe we should be more user friendly and allow the user to specify chunk ids by arbitrary non-negative integers (negative is interpreted as "not in chunk"), even if they do start at 0 and are not contiguous, just like we (and sklearn) do for methods which are fitted on a classic class vector y.

Steps/Code to Reproduce

from metric_learn import RCA
X = [[-0.05,  3.0],[0.05, -3.0],
    [0.1, -3.55],[-0.1, 3.55],
    [-0.95, -0.05],[0.95, 0.05],
    [0.4,  0.05],[-0.4, -0.05]]
chunks = [1, 1, 2, 2, 3, 3, 4, 4]

rca = RCA()
X=rca.fit_transform(X, chunks)

Expected Results

No thrown unexpected warnings.

Actual Results

The following warnings are thrown:

./metric-learn/rca.py:28: RuntimeWarning: Mean of empty slice.
  chunk_data[mask] -= chunk_data[mask].mean(axis=0)
./numpy/core/_methods.py:154: RuntimeWarning: invalid value encountered in true_divide
  ret, rcount, out=ret, casting='unsafe', subok=False)

Versions

Linux-5.0.0-37-generic-x86_64-with-Ubuntu-18.04-bionic Python 3.6.9 (default, Nov 7 2019, 10:44:02) [GCC 8.3.0] NumPy 1.18.1 SciPy 1.4.1 Scikit-Learn 0.22.1 Metric-Learn 0.5.0

grudloff avatar Jan 20 '20 13:01 grudloff

Yes, it would be nicer to do pre-processing on chunk labels with np.unique(), similar to scikit-learn. It's potentially a performance hit, though, so we might want to allow users to skip it.

perimosocordiae avatar Jan 21 '20 14:01 perimosocordiae

To solve this it should be necessary to replace the current max() computation by the unique() computation, I don't think the performance difference should be significant. In any case, this issue arises in the chunk mean centering which is going to be deprecated in the future.

grudloff avatar Jan 22 '20 14:01 grudloff