hdbscan icon indicating copy to clipboard operation
hdbscan copied to clipboard

How to deal with duplicated samples ?

Open katzurik opened this issue 2 years ago • 7 comments

We have a large list of inputs, some of which include duplication (embedded short texts). We find out that if we do not remove duplicates, some of the duplicated samples are being clustered into different clusters.

  1. Is it supposed to happen?
  2. Do including the duplication improve whatsoever the clustering? ( under the assumption that common inputs should weight the cluster "around" them )

Thank you

katzurik avatar Jun 08 '22 10:06 katzurik

Duplicates ending up in different clusters shouldn't happen; it isn't intended behaviour. So something is a little odd there. It is possible if everything lines up just so (core distances, cluster selection methods, etc.) but perhaps there is a bug? Can you run with algorithm="generic" and see if there problem persists?

Including duplicates can "improve" the clustering, but only if you really want to give more weight to the actual duplicates.

As a potential workaround it may be useful to add a very small amount of noise to duplicates to separate them slightly from each other.

lmcinnes avatar Jun 08 '22 14:06 lmcinnes

I agree with Leland that duplicates ending in different clusters shouldn't happen. It might be worth double checking that your duplicate text points are being vectorized to the same point. A common text processing chain is text -> vector representation -> hdbscan. There are a couple of ways that things can go wrong in the vectorization process that might affect the hdbscan clustering.

It could certainly be a bug in hdbscan but since it's a quick check it is probably worth a look.

On Wed, Jun 8, 2022 at 10:39 AM Leland McInnes @.***> wrote:

Duplicates ending up in different clusters shouldn't happen; it isn't intended behaviour. So something is a little odd there. It is possible if everything lines up just so (core distances, cluster selection methods, etc.) but perhaps there is a bug? Can you run with algorithm="generic" and see if there problem persists?

Including duplicates can "improve" the clustering, but only if you really want to give more weight to the actual duplicates.

As a potential workaround it may be useful to add a very small amount of noise to duplicates to separate them slightly from each other.

— Reply to this email directly, view it on GitHub https://github.com/scikit-learn-contrib/hdbscan/issues/548#issuecomment-1150008999, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC3IUWTMRZZGB7OTWEJDYMLVOCWAPANCNFSM5YGABLSQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

jc-healy avatar Jun 13 '22 13:06 jc-healy

We double-checked it by copying the vector representation and inserting it for each duplicate.

Running the vanilla "generic" doesn't change the results regarding the duplication..

katzurik avatar Jun 13 '22 14:06 katzurik

Awesome, I'm happy to hear that you'd already checked that potential problem. Of course that does mean we've got some particularly odd behaviour going on. Any chance that you could post a small reproducible data set? Feel free to strip the text but the set of numeric vectors and a few lines of code demonstrating the problem might be handy in figuring out what is going on.

On Mon, Jun 13, 2022 at 10:02 AM Uri K @.***> wrote:

We double-checked it by copying the vector representation and inserting it for each duplicate.

Running the vanilla "generic" doesn't change the results regarding the duplication..

— Reply to this email directly, view it on GitHub https://github.com/scikit-learn-contrib/hdbscan/issues/548#issuecomment-1153956419, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC3IUWWJ33ZZPCNEGANXLQDVO45QFANCNFSM5YGABLSQ . You are receiving this because you commented.Message ID: @.***>

jc-healy avatar Jun 13 '22 14:06 jc-healy

Here is a simple example:

data = np.array([[1,1]] * 500)
clusterer = hdbscan.HDBSCAN(min_cluster_size=5, cluster_selection_method='eom').fit(data)
clusterer.labels_

result:


array([ 1, -1, -1,  0,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1, -1, -1, -1, -1, -1, -1,
       -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
       -1, -1, -1, -1, -1, -1, -1, -1, -1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
       -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
       -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
       -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
       -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
       -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
       -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
       -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  1, -1,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0, -1])

Pavelrst avatar Jun 20 '22 14:06 Pavelrst

Thanks for the code example, they always make things easier.

That particular example looks like a problem with passing hdbscan a single cluster of data, and not the fact that there are many duplicates. The default behaviour for hdbscan requires more than one cluster. For this example the best thing to do would be to use allow_single_cluster=True. If you use the following cluster parameters then you get the expected label of all zeros back from hdbscan.
clusterer = hdbscan.HDBSCAN(min_cluster_size=5, cluster_selection_method='eom', allow_single_cluster=True).fit(data)

That said, I feel like we should at least be trying to detect when this occurs and warning the end user. Perhaps suggesting that they try setting allow_single_cluster to True.

I tried to alter your example to have two piles of exact dupes and wasn't able to find any unexpected fragmentation.

data = np.vstack((np.array([[1,1]] * 500), np.array([[2,2]] * 500)))
clusterer = hdbscan.HDBSCAN(min_cluster_size=5, cluster_selection_method='eom').fit(data)
clusterer.labels_

Generates 500 zeros followed by 500 ones.

jc-healy avatar Jun 20 '22 16:06 jc-healy