pyclustering icon indicating copy to clipboard operation
pyclustering copied to clipboard

`kmeans._kmeans__update_centers` fail to optimize custom `metric`

Open yasirroni opened this issue 2 years ago • 0 comments

I've investigated pyclustering to do time series clustering using custom metric, that is dtw. See below for the custom metric and the code. The result is that pyclustering stuck at local optima, even though able to use the custom metric. That is because pyclustering did know how to update based on the custom metric, since kmeans._kmeans__update_centers did not fully consider the custom metric and use centers[index] = cluster_points.mean(axis=0).

Possible solution is to give user a custom cluster centers update strategy.

For the mean time, I'm using other package, but I love pyclustering idea to give user a custom metric that suitable for research use.

import array
import math
import numpy as np
from numba import jit

@jit(nopython=True)
def jit_dtw_distance(s1, s2):
    """
    Dynamic Time Warping.
    
    jitted version of:
    https://github.com/wannesm/dtaidistance/blob/master/dtaidistance/dtw.py#L186
    @5ebdd25b820a02dde7f25f05b4cf7c0faad9840d
    
    Improved 
    from 53.7 ms ± 4.26 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
    to 127 µs ± 15.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

    This function keeps a compact matrix, not the full warping paths matrix.

    Uses dynamic programming to compute:

    wps[i, j] = (s1[i]-s2[j])**2 + min(
                    wps[i-1, j  ] + penalty,  // vertical   / insertion / expansion
                    wps[i  , j-1] + penalty,  // horizontal / deletion  / compression
                    wps[i-1, j-1])            // diagonal   / match
    dtw = sqrt(wps[-1, -1])

    :param s1: First sequence
    :param s2: Second sequence
    :param window: Only allow for maximal shifts from the two diagonals smaller than this number.
        It includes the diagonal, meaning that an Euclidean distance is obtained by setting window=1.
    :param max_dist: Stop if the returned values will be larger than this value
    :param max_step: Do not allow steps larger than this value
    :param max_length_diff: Return infinity if length of two series is larger
    :param penalty: Penalty to add if compression or expansion is applied
    :param psi: Psi relaxation parameter (ignore start and end of matching).
        Useful for cyclical series.
    :param use_c: Use fast pure c compiled functions
    :param use_pruning: Prune values based on Euclidean distance.
        This is the same as passing ub_euclidean() to max_dist
    :param only_ub: Only compute the upper bound (Euclidean).

    Returns: DTW distance
    """
    r, c = len(s1), len(s2)
    window = max(r, c)
    max_step = np.inf
    max_dist = np.inf
    
    penalty = 0
    psi_1b, psi_1e, psi_2b, psi_2e = (0, 0, 0, 0)
    
    length = min(c + 1, abs(r - c) + 2 * (window - 1) + 1 + 1 + 1)

    dtw = np.inf * np.ones(2 * length)
    sc = 0
    ec = 0
    ec_next = 0
    smaller_found = False
    for i in range(psi_2b + 1):
        dtw[i] = 0
    skip = 0
    i0 = 1
    i1 = 0
    psi_shortest = np.inf
    for i in range(r):
        skipp = skip
        skip = max(0, i - max(0, r - c) - window + 1)
        i0 = 1 - i0
        i1 = 1 - i1
        for ii in range(i1*length, i1*length+length):
            dtw[ii] = np.inf
        j_start = max(0, i - max(0, r - c) - window + 1)
        j_end = min(c, i + max(0, c - r) + window)
        if sc > j_start:
            j_start = sc
        smaller_found = False
        ec_next = i
        if length == c + 1:
            skip = 0
        if psi_1b != 0 and j_start == 0 and i < psi_1b:
            dtw[i1 * length] = 0
        for j in range(j_start, j_end):
            d = (s1[i] - s2[j])**2
            if d > max_step:
                continue
            assert j + 1 - skip >= 0
            assert j - skipp >= 0
            assert j + 1 - skipp >= 0
            assert j - skip >= 0
            dtw[i1 * length + j + 1 - skip] = d + min(dtw[i0 * length + j - skipp],
                                                      dtw[i0 * length + j + 1 - skipp] + penalty,
                                                      dtw[i1 * length + j - skip] + penalty)
            if dtw[i1 * length + j + 1 - skip] > max_dist:
                if not smaller_found:
                    sc = j + 1
                if j >= ec:
                    break
            else:
                smaller_found = True
                ec_next = j + 1
        ec = ec_next
        if psi_1e != 0 and j_end == len(s2) and len(s1) - 1 - i <= psi_1e:
            psi_shortest = min(psi_shortest, dtw[i1 * length + j_end - skip])
    if psi_1e == 0 and psi_2e == 0:
        d = dtw[i1 * length + min(c, c + window - 1) - skip]
    else:
        ic = min(c, c + window - 1) - skip
        if psi_2e != 0:
            vc = dtw[i1 * length + ic - psi_2e:i1 * length + ic + 1]
            d = min(np.min(vc), psi_shortest)
        else:
            d = min(dtw[i1 * length + min(c, c + window - 1) - skip], psi_shortest)
    if max_dist and d > max_dist:
        d = np.inf
    return d
from pyclustering.cluster.kmeans import kmeans
from pyclustering.cluster.center_initializer import kmeans_plusplus_initializer
from pyclustering.utils.metric import type_metric, distance_metric

from pyclustering.cluster.kmeans import kmeans
from pyclustering.cluster.center_initializer import kmeans_plusplus_initializer
from pyclustering.utils.metric import type_metric, distance_metric

yasirroni avatar Jan 04 '22 06:01 yasirroni