pynndescent
pynndescent copied to clipboard
Cosine metric - error "Negative values in data passed to precomputed distance matrix"
When PyNNDescentTransformer is used with KNeighborsClassifier with cosine metric, it raises an error:
pynndescent_ann = make_pipeline(
PyNNDescentTransformer(metric="cosine", random_state=0),
KNeighborsClassifier(metric="precomputed"),
)
pynndescent_ann.fit(X_train, y_train)
Error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/tmp/ipykernel_62071/159833562.py in <cell line: 35>()
33
34 start_time = time()
---> 35 y_pred_pynndescent = pynndescent_ann.predict(X_test)
36 end_time = time()
37
~/anaconda3/envs/podstawy-uczenia-maszynowego-rozwiazania/lib/python3.10/site-packages/sklearn/pipeline.py in predict(self, X, **predict_params)
479 for _, name, transform in self._iter(with_final=False):
480 Xt = transform.transform(Xt)
--> 481 return self.steps[-1][1].predict(Xt, **predict_params)
482
483 @available_if(_final_estimator_has("fit_predict"))
~/anaconda3/envs/podstawy-uczenia-maszynowego-rozwiazania/lib/python3.10/site-packages/sklearn/neighbors/_classification.py in predict(self, X)
232 # In that case, we do not need the distances to perform
233 # the weighting so we do not compute them.
--> 234 neigh_ind = self.kneighbors(X, return_distance=False)
235 neigh_dist = None
236 else:
~/anaconda3/envs/podstawy-uczenia-maszynowego-rozwiazania/lib/python3.10/site-packages/sklearn/neighbors/_base.py in kneighbors(self, X, n_neighbors, return_distance)
802 else:
803 if self.metric == "precomputed":
--> 804 X = _check_precomputed(X)
805 else:
806 X = self._validate_data(X, accept_sparse="csr", reset=False, order="C")
~/anaconda3/envs/podstawy-uczenia-maszynowego-rozwiazania/lib/python3.10/site-packages/sklearn/neighbors/_base.py in _check_precomputed(X)
192 copied = graph.format != "csr"
193 graph = check_array(graph, accept_sparse="csr")
--> 194 check_non_negative(graph, whom="precomputed distance matrix.")
195 graph = sort_graph_by_row_values(graph, copy=not copied, warn_when_not_sorted=True)
196
~/anaconda3/envs/podstawy-uczenia-maszynowego-rozwiazania/lib/python3.10/site-packages/sklearn/utils/validation.py in check_non_negative(X, whom)
1416
1417 if X_min < 0:
-> 1418 raise ValueError("Negative values in data passed to %s" % whom)
1419
1420
ValueError: Negative values in data passed to precomputed distance matrix.
This makes sense, since according to the docs (https://pynndescent.readthedocs.io/en/latest/pynndescent_metrics.html#Beware-of-bounded-distances):
This means that, for example, internally PyNNDescent uses the negative log of the cosine similarity instead of cosine distance (and converts the distance values when done).
However, those distances are probably not converted when using PyNNDescentTransformer, hence the error.
Thanks, I'll try to look into this when I get a chance.
The dataset used was Codon Usage. Interestingly enough, it has only nonnegative values (codon percentages), so regular cosine is always nonnegative as well. Exact code:
import pandas as pd
from pynndescent import PyNNDescentTransformer
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from pynndescent import PyNNDescentTransformer
from sklearn.pipeline import make_pipeline
df = pd.read_csv("codon_usage.csv")
df = df[pd.to_numeric(df["UUU"], errors="coerce").notnull()].copy()
df = df.copy() # to avoid irritating SettingWithCopyWarning
df["UUU"] = df.loc[:, "UUU"].astype(float)
df["UUC"] = df.loc[:, "UUC"].astype(float)
df = df.loc[df["Ncodons"] >= 1000, :]
df = df.loc[df["Kingdom"] != "plm", :]
df = df.drop(["DNAtype", "SpeciesID", "Ncodons", "SpeciesName"], axis="columns")
kingdom_mapping = {
"arc": 0,
"bct": 1,
"pln": 2,
"inv": 2,
"vrt": 2,
"mam": 2,
"rod": 2,
"pri": 2,
"phg": 3,
"vrl": 4,
}
df = df.replace({"Kingdom": kingdom_mapping})
y = df.pop("Kingdom")
X_train, X_test, y_train, y_test = train_test_split(
df, y, test_size=0.2, random_state=0, stratify=y
)
sklearn_knn = KNeighborsClassifier(metric="cosine")
pynndescent_ann = make_pipeline(
PyNNDescentTransformer(metric="cosine", random_state=0),
KNeighborsClassifier(metric="precomputed"),
)
sklearn_knn.fit(X_train, y_train)
pynndescent_ann.fit(X_train, y_train)