kmodes icon indicating copy to clipboard operation
kmodes copied to clipboard

Document requirements for custom dissimilarity functions

Open sharma-ji opened this issue 5 years ago • 2 comments

I am trying to implement hamming distance for categorical data, but I am getting an error

C:\Users\Mukul.Sharma\AppData\Local\Continuum\Anaconda3\lib\site-packages\kmodes\kmodes.py in init_huang(X, n_clusters, dissim) 39 # so set centroid to closest point in X. 40 for ik in range(n_clusters): ---> 41 ndx = np.argsort(dissim(X, centroids[ik])) 42 # We want the centroid to be unique, if possible. 43 while np.all(X[ndx[0]] == centroids, axis=1).any() and ndx.shape[0] > 1:

my hamming distance is: def hamming_distance(s1, s2): """Return the Hamming distance between equal-length sequences""" if len(s1) != len(s2): raise ValueError("Undefined for sequences of unequal length") return sum(el1 != el2 for el1, el2 in zip(s1, s2))

I am having issues with scipy hamming as well (scipy.spatial.distance.hamming) Here the error says

ValueError: Input vector should be 1-D Can you please help me ? Also give me an idea for writing my custom distance metric, like telling me the internal working of this algo (K-prototypes?

sharma-ji avatar Jan 11 '19 11:01 sharma-ji

Have a look here for how the other dissimilarity functions work: https://github.com/nicodv/kmodes/blob/master/kmodes/util/tests/test_dissim.py

Looks like you need to adapt your function to accept 2D vectors, whereas right now it assumes 1d vectors.

I should document this better somewhere, so dedicating this ticket to that.

nicodv avatar Jan 11 '19 19:01 nicodv

I am trying to implement hamming distance for categorical data, but I am getting an error

C:\Users\Mukul.Sharma\AppData\Local\Continuum\Anaconda3\lib\site-packages\kmodes\kmodes.py in init_huang(X, n_clusters, dissim) 39 # so set centroid to closest point in X. 40 for ik in range(n_clusters): ---> 41 ndx = np.argsort(dissim(X, centroids[ik])) 42 # We want the centroid to be unique, if possible. 43 while np.all(X[ndx[0]] == centroids, axis=1).any() and ndx.shape[0] > 1:

my hamming distance is: def hamming_distance(s1, s2): """Return the Hamming distance between equal-length sequences""" if len(s1) != len(s2): raise ValueError("Undefined for sequences of unequal length") return sum(el1 != el2 for el1, el2 in zip(s1, s2))

I am having issues with scipy hamming as well (scipy.spatial.distance.hamming) Here the error says

ValueError: Input vector should be 1-D Can you please help me ? Also give me an idea for writing my custom distance metric, like telling me the internal working of this algo (K-prototypes?

Just wanna mention, that the hamming distance is gennerally the same as the "overlap" measure computed in matching_dissim().

ghost avatar Apr 03 '19 14:04 ghost