kmodes
kmodes copied to clipboard
Document requirements for custom dissimilarity functions
I am trying to implement hamming distance for categorical data, but I am getting an error
C:\Users\Mukul.Sharma\AppData\Local\Continuum\Anaconda3\lib\site-packages\kmodes\kmodes.py in init_huang(X, n_clusters, dissim) 39 # so set centroid to closest point in X. 40 for ik in range(n_clusters): ---> 41 ndx = np.argsort(dissim(X, centroids[ik])) 42 # We want the centroid to be unique, if possible. 43 while np.all(X[ndx[0]] == centroids, axis=1).any() and ndx.shape[0] > 1:
my hamming distance is:
def hamming_distance(s1, s2): """Return the Hamming distance between equal-length sequences""" if len(s1) != len(s2): raise ValueError("Undefined for sequences of unequal length") return sum(el1 != el2 for el1, el2 in zip(s1, s2))
I am having issues with scipy hamming as well (scipy.spatial.distance.hamming) Here the error says
ValueError: Input vector should be 1-D
Can you please help me ?
Also give me an idea for writing my custom distance metric, like telling me the internal working of this algo (K-prototypes?
Have a look here for how the other dissimilarity functions work: https://github.com/nicodv/kmodes/blob/master/kmodes/util/tests/test_dissim.py
Looks like you need to adapt your function to accept 2D vectors, whereas right now it assumes 1d vectors.
I should document this better somewhere, so dedicating this ticket to that.
I am trying to implement hamming distance for categorical data, but I am getting an error
C:\Users\Mukul.Sharma\AppData\Local\Continuum\Anaconda3\lib\site-packages\kmodes\kmodes.py in init_huang(X, n_clusters, dissim) 39 # so set centroid to closest point in X. 40 for ik in range(n_clusters): ---> 41 ndx = np.argsort(dissim(X, centroids[ik])) 42 # We want the centroid to be unique, if possible. 43 while np.all(X[ndx[0]] == centroids, axis=1).any() and ndx.shape[0] > 1:
my hamming distance is:
def hamming_distance(s1, s2): """Return the Hamming distance between equal-length sequences""" if len(s1) != len(s2): raise ValueError("Undefined for sequences of unequal length") return sum(el1 != el2 for el1, el2 in zip(s1, s2))
I am having issues with scipy hamming as well (scipy.spatial.distance.hamming) Here the error says
ValueError: Input vector should be 1-D
Can you please help me ? Also give me an idea for writing my custom distance metric, like telling me the internal working of this algo (K-prototypes?
Just wanna mention, that the hamming distance is gennerally the same as the "overlap" measure computed in matching_dissim().