pyclustering
pyclustering copied to clipboard
`kmeans._kmeans__update_centers` fail to optimize custom `metric`
I've investigated pyclustering
to do time series clustering using custom metric, that is dtw
. See below for the custom metric
and the code. The result is that pyclustering
stuck at local optima, even though able to use the custom metric
. That is because pyclustering
did know how to update based on the custom metric
, since kmeans._kmeans__update_centers
did not fully consider the custom metric
and use centers[index] = cluster_points.mean(axis=0)
.
Possible solution is to give user a custom cluster centers update strategy.
For the mean time, I'm using other package, but I love pyclustering
idea to give user a custom metric
that suitable for research use.
import array
import math
import numpy as np
from numba import jit
@jit(nopython=True)
def jit_dtw_distance(s1, s2):
"""
Dynamic Time Warping.
jitted version of:
https://github.com/wannesm/dtaidistance/blob/master/dtaidistance/dtw.py#L186
@5ebdd25b820a02dde7f25f05b4cf7c0faad9840d
Improved
from 53.7 ms ± 4.26 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
to 127 µs ± 15.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
This function keeps a compact matrix, not the full warping paths matrix.
Uses dynamic programming to compute:
wps[i, j] = (s1[i]-s2[j])**2 + min(
wps[i-1, j ] + penalty, // vertical / insertion / expansion
wps[i , j-1] + penalty, // horizontal / deletion / compression
wps[i-1, j-1]) // diagonal / match
dtw = sqrt(wps[-1, -1])
:param s1: First sequence
:param s2: Second sequence
:param window: Only allow for maximal shifts from the two diagonals smaller than this number.
It includes the diagonal, meaning that an Euclidean distance is obtained by setting window=1.
:param max_dist: Stop if the returned values will be larger than this value
:param max_step: Do not allow steps larger than this value
:param max_length_diff: Return infinity if length of two series is larger
:param penalty: Penalty to add if compression or expansion is applied
:param psi: Psi relaxation parameter (ignore start and end of matching).
Useful for cyclical series.
:param use_c: Use fast pure c compiled functions
:param use_pruning: Prune values based on Euclidean distance.
This is the same as passing ub_euclidean() to max_dist
:param only_ub: Only compute the upper bound (Euclidean).
Returns: DTW distance
"""
r, c = len(s1), len(s2)
window = max(r, c)
max_step = np.inf
max_dist = np.inf
penalty = 0
psi_1b, psi_1e, psi_2b, psi_2e = (0, 0, 0, 0)
length = min(c + 1, abs(r - c) + 2 * (window - 1) + 1 + 1 + 1)
dtw = np.inf * np.ones(2 * length)
sc = 0
ec = 0
ec_next = 0
smaller_found = False
for i in range(psi_2b + 1):
dtw[i] = 0
skip = 0
i0 = 1
i1 = 0
psi_shortest = np.inf
for i in range(r):
skipp = skip
skip = max(0, i - max(0, r - c) - window + 1)
i0 = 1 - i0
i1 = 1 - i1
for ii in range(i1*length, i1*length+length):
dtw[ii] = np.inf
j_start = max(0, i - max(0, r - c) - window + 1)
j_end = min(c, i + max(0, c - r) + window)
if sc > j_start:
j_start = sc
smaller_found = False
ec_next = i
if length == c + 1:
skip = 0
if psi_1b != 0 and j_start == 0 and i < psi_1b:
dtw[i1 * length] = 0
for j in range(j_start, j_end):
d = (s1[i] - s2[j])**2
if d > max_step:
continue
assert j + 1 - skip >= 0
assert j - skipp >= 0
assert j + 1 - skipp >= 0
assert j - skip >= 0
dtw[i1 * length + j + 1 - skip] = d + min(dtw[i0 * length + j - skipp],
dtw[i0 * length + j + 1 - skipp] + penalty,
dtw[i1 * length + j - skip] + penalty)
if dtw[i1 * length + j + 1 - skip] > max_dist:
if not smaller_found:
sc = j + 1
if j >= ec:
break
else:
smaller_found = True
ec_next = j + 1
ec = ec_next
if psi_1e != 0 and j_end == len(s2) and len(s1) - 1 - i <= psi_1e:
psi_shortest = min(psi_shortest, dtw[i1 * length + j_end - skip])
if psi_1e == 0 and psi_2e == 0:
d = dtw[i1 * length + min(c, c + window - 1) - skip]
else:
ic = min(c, c + window - 1) - skip
if psi_2e != 0:
vc = dtw[i1 * length + ic - psi_2e:i1 * length + ic + 1]
d = min(np.min(vc), psi_shortest)
else:
d = min(dtw[i1 * length + min(c, c + window - 1) - skip], psi_shortest)
if max_dist and d > max_dist:
d = np.inf
return d
from pyclustering.cluster.kmeans import kmeans
from pyclustering.cluster.center_initializer import kmeans_plusplus_initializer
from pyclustering.utils.metric import type_metric, distance_metric
from pyclustering.cluster.kmeans import kmeans
from pyclustering.cluster.center_initializer import kmeans_plusplus_initializer
from pyclustering.utils.metric import type_metric, distance_metric