PyNomaly
PyNomaly copied to clipboard
Passing cluster_labels broken
I think I have found a bug that occurs when passing some cluster_labels
.
When I completely reverse the order of all input (data
and cluster_labels
), and I reverse the result (local_outlier_probabilities
), I would expect the same numbers. This does happen as long as all cluster_labels
values are equal. Once I have two (really separate) clusters, the results change when flipped!
An extra indication that things go wrong (IMHO): the second cluster's neighbor numbers are in the first cluster!
A small reproduction example:
import matplotlib.pyplot as plt
from PyNomaly import loop
np.random.seed(1)
n = 9
data = np.append(np.random.normal(2, 1, [n, 2]), np.random.normal(8, 1, [n, 2]), axis=0)
clus = np.append(np.ones(n), 2 * np.ones(n)).tolist() # 2 cluster numbers!
model = loop.LocalOutlierProbability(data, n_neighbors=5, cluster_labels=clus)
fit = model.fit()
res = fit.local_outlier_probabilities
print(res)
print(fit.neighbor_matrix)
data_flipped = np.flipud(data)
clus_flipped = np.flipud(clus).tolist()
model2 = loop.LocalOutlierProbability(data_flipped, n_neighbors=5, cluster_labels=clus_flipped)
fit2 = model2.fit()
res2 = np.flipud(fit2.local_outlier_probabilities)
print(res2)
print(np.flipud(fit2.neighbor_matrix))
s = 1 + 100 * res.astype(float)
s2 = 1 + 100 * res2.astype(float)
plt.scatter(data[:, 0], data[:, 1], c=clus, s=s, marker='+')
plt.scatter(data[:, 0], data[:, 1], c=clus, s=s2, marker='x')
plt.show()
The problem is in the 'definition' of neighbor_matrix
: _compute_distance_and_neighbor_matrix
returns indexes within the cluster, but _prob_distances_ev
treats the numbers as being global.
Hey @mdruiter - thanks for noting the issue and where it is occurring.
Are you able to submit a fix in a pull request?