metric-learn
metric-learn copied to clipboard
Warning on RCA with chunk labeling not starting on zero or with gaps
Description
RCA chunks are expected to start at zero and increase one by one, this raises a warning in case it doesn't start at zero or has any gap. Although these warnings are raised this doesn't affect the result of the RCA fit.
Maybe we should be more user friendly and allow the user to specify chunk ids by arbitrary non-negative integers (negative is interpreted as "not in chunk"), even if they do start at 0 and are not contiguous, just like we (and sklearn) do for methods which are fitted on a classic class vector y
.
Steps/Code to Reproduce
from metric_learn import RCA
X = [[-0.05, 3.0],[0.05, -3.0],
[0.1, -3.55],[-0.1, 3.55],
[-0.95, -0.05],[0.95, 0.05],
[0.4, 0.05],[-0.4, -0.05]]
chunks = [1, 1, 2, 2, 3, 3, 4, 4]
rca = RCA()
X=rca.fit_transform(X, chunks)
Expected Results
No thrown unexpected warnings.
Actual Results
The following warnings are thrown:
./metric-learn/rca.py:28: RuntimeWarning: Mean of empty slice.
chunk_data[mask] -= chunk_data[mask].mean(axis=0)
./numpy/core/_methods.py:154: RuntimeWarning: invalid value encountered in true_divide
ret, rcount, out=ret, casting='unsafe', subok=False)
Versions
Linux-5.0.0-37-generic-x86_64-with-Ubuntu-18.04-bionic Python 3.6.9 (default, Nov 7 2019, 10:44:02) [GCC 8.3.0] NumPy 1.18.1 SciPy 1.4.1 Scikit-Learn 0.22.1 Metric-Learn 0.5.0
Yes, it would be nicer to do pre-processing on chunk labels with np.unique()
, similar to scikit-learn. It's potentially a performance hit, though, so we might want to allow users to skip it.
To solve this it should be necessary to replace the current max() computation by the unique() computation, I don't think the performance difference should be significant. In any case, this issue arises in the chunk mean centering which is going to be deprecated in the future.