spopt icon indicating copy to clipboard operation
spopt copied to clipboard

Throws an error when silhoutte coefficients for two clusters are the same

Open vinkolar opened this issue 7 years ago • 2 comments

https://github.com/pysal/pysal/blob/8612af9f83cef755baa9e040222a7c200073690a/pysal/cg/rtree.py#L461

We are doing a max of a tuple with first element as a float (silhoutte coefft) and second element as a _NodeCursor. Generally, this works fine if the first element is different. However, there are corner cases when silhoutte for two clusters are exactly the same. At this point the second element of the tuple (_NodeCursor) is compared. And, ofcourse, comparison operators (>,<) are not valid for the class _NodeCursor. Hence it throws an error:

...
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pysal/cg/rtree.py", line 200, in insert
    self.cursor.insert(o, orect)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pysal/cg/rtree.py", line 417, in insert
    self._balance()
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pysal/cg/rtree.py", line 462, in _balance
    [(silhouette_coeff(c, memo), c) for c in clusterings])
TypeError: '>' not supported between instances of '_NodeCursor' and '_NodeCursor'

Fix: Take the argmax of silhoutte score, and choose one cluster based on the index: Line number: 462: /Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pysal/cg/rtree.py

Replace below original code:

score, bestcluster = max(
  [(silhouette_coeff(c, memo), c) for c in clusterings])

by the below code

import numpy as np
#print([(silhouette_coeff(c, memo), c) for c in clusterings])
sil_scores = [silhouette_coeff(c, memo) for c in clusterings]
best_sil_index = np.argmax(sil_scores)
bestcluster = clusterings[best_sil_index]
score = sil_scores[best_sil_index]

vinkolar avatar Jan 31 '18 19:01 vinkolar

Hi, sorry for the delay on addressing this, but we're working to change implementations to pysal/region, a separate package released as region. We hope to have this done by the start of the summer, and I'll be sure to ping you when it occurs. Beyond this, I'll keep this open.

ljwolf avatar May 08 '18 12:05 ljwolf

A simple fix would be to pass key=lambda x: x[0] as a keyword argument to max().

In /cg/rtree.py, replace lines 461-462

score, bestcluster = max(
    [(silhouette_coeff(c, memo), c) for c in clusterings])

with this

score, bestcluster = max(
    [(silhouette_coeff(c, memo), c) for c in clusterings], key=lambda x: x[0])

The function then works as expected even when there are two or more equal silhouette coefficients.

badwolf259 avatar Jan 27 '19 09:01 badwolf259

Note to self: Try out suggested fix here in /libpysal/cg/rtree.py.

jGaboardi avatar Oct 22 '22 02:10 jGaboardi

@vinkolar @badwolf259 Do you have a reproducible example that generates silhouettes for two clusters are exactly the same?

jGaboardi avatar Oct 26 '22 17:10 jGaboardi

It actually seems that rtree isn't even used in spopt. Closing for now. We can reopen if I am missing something.

jGaboardi avatar Oct 26 '22 18:10 jGaboardi