pynndescent icon indicating copy to clipboard operation
pynndescent copied to clipboard

Cosine metric - error "Negative values in data passed to precomputed distance matrix"

Open j-adamczyk opened this issue 1 year ago • 2 comments

When PyNNDescentTransformer is used with KNeighborsClassifier with cosine metric, it raises an error:

pynndescent_ann = make_pipeline(
    PyNNDescentTransformer(metric="cosine", random_state=0),
    KNeighborsClassifier(metric="precomputed"),
)

pynndescent_ann.fit(X_train, y_train)

Error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_62071/159833562.py in <cell line: 35>()
     33 
     34 start_time = time()
---> 35 y_pred_pynndescent = pynndescent_ann.predict(X_test)
     36 end_time = time()
     37 

~/anaconda3/envs/podstawy-uczenia-maszynowego-rozwiazania/lib/python3.10/site-packages/sklearn/pipeline.py in predict(self, X, **predict_params)
    479         for _, name, transform in self._iter(with_final=False):
    480             Xt = transform.transform(Xt)
--> 481         return self.steps[-1][1].predict(Xt, **predict_params)
    482 
    483     @available_if(_final_estimator_has("fit_predict"))

~/anaconda3/envs/podstawy-uczenia-maszynowego-rozwiazania/lib/python3.10/site-packages/sklearn/neighbors/_classification.py in predict(self, X)
    232             # In that case, we do not need the distances to perform
    233             # the weighting so we do not compute them.
--> 234             neigh_ind = self.kneighbors(X, return_distance=False)
    235             neigh_dist = None
    236         else:

~/anaconda3/envs/podstawy-uczenia-maszynowego-rozwiazania/lib/python3.10/site-packages/sklearn/neighbors/_base.py in kneighbors(self, X, n_neighbors, return_distance)
    802         else:
    803             if self.metric == "precomputed":
--> 804                 X = _check_precomputed(X)
    805             else:
    806                 X = self._validate_data(X, accept_sparse="csr", reset=False, order="C")

~/anaconda3/envs/podstawy-uczenia-maszynowego-rozwiazania/lib/python3.10/site-packages/sklearn/neighbors/_base.py in _check_precomputed(X)
    192     copied = graph.format != "csr"
    193     graph = check_array(graph, accept_sparse="csr")
--> 194     check_non_negative(graph, whom="precomputed distance matrix.")
    195     graph = sort_graph_by_row_values(graph, copy=not copied, warn_when_not_sorted=True)
    196 

~/anaconda3/envs/podstawy-uczenia-maszynowego-rozwiazania/lib/python3.10/site-packages/sklearn/utils/validation.py in check_non_negative(X, whom)
   1416 
   1417     if X_min < 0:
-> 1418         raise ValueError("Negative values in data passed to %s" % whom)
   1419 
   1420 

ValueError: Negative values in data passed to precomputed distance matrix.

This makes sense, since according to the docs (https://pynndescent.readthedocs.io/en/latest/pynndescent_metrics.html#Beware-of-bounded-distances):

This means that, for example, internally PyNNDescent uses the negative log of the cosine similarity instead of cosine distance (and converts the distance values when done).

However, those distances are probably not converted when using PyNNDescentTransformer, hence the error.

j-adamczyk avatar Mar 18 '23 16:03 j-adamczyk

Thanks, I'll try to look into this when I get a chance.

lmcinnes avatar Mar 18 '23 18:03 lmcinnes

The dataset used was Codon Usage. Interestingly enough, it has only nonnegative values (codon percentages), so regular cosine is always nonnegative as well. Exact code:

import pandas as pd
from pynndescent import PyNNDescentTransformer
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from pynndescent import PyNNDescentTransformer
from sklearn.pipeline import make_pipeline

df = pd.read_csv("codon_usage.csv")
df = df[pd.to_numeric(df["UUU"], errors="coerce").notnull()].copy()
df = df.copy()  # to avoid irritating SettingWithCopyWarning
df["UUU"] = df.loc[:, "UUU"].astype(float)
df["UUC"] = df.loc[:, "UUC"].astype(float)
df = df.loc[df["Ncodons"] >= 1000, :]
df = df.loc[df["Kingdom"] != "plm", :]
df = df.drop(["DNAtype", "SpeciesID", "Ncodons", "SpeciesName"], axis="columns")
kingdom_mapping = {
    "arc": 0,
    "bct": 1,
    "pln": 2,
    "inv": 2,
    "vrt": 2,
    "mam": 2,
    "rod": 2,
    "pri": 2,
    "phg": 3,
    "vrl": 4,
}
df = df.replace({"Kingdom": kingdom_mapping})
y = df.pop("Kingdom")


X_train, X_test, y_train, y_test = train_test_split(
    df, y, test_size=0.2, random_state=0, stratify=y
)

sklearn_knn = KNeighborsClassifier(metric="cosine")

pynndescent_ann = make_pipeline(
    PyNNDescentTransformer(metric="cosine", random_state=0),
    KNeighborsClassifier(metric="precomputed"),
)

sklearn_knn.fit(X_train, y_train)
pynndescent_ann.fit(X_train, y_train)

j-adamczyk avatar Mar 18 '23 20:03 j-adamczyk