spopt
spopt copied to clipboard
Throws an error when silhoutte coefficients for two clusters are the same
https://github.com/pysal/pysal/blob/8612af9f83cef755baa9e040222a7c200073690a/pysal/cg/rtree.py#L461
We are doing a max of a tuple with first element as a float (silhoutte coefft) and second element as a _NodeCursor. Generally, this works fine if the first element is different. However, there are corner cases when silhoutte for two clusters are exactly the same. At this point the second element of the tuple (_NodeCursor) is compared. And, ofcourse, comparison operators (>,<) are not valid for the class _NodeCursor. Hence it throws an error:
...
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pysal/cg/rtree.py", line 200, in insert
self.cursor.insert(o, orect)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pysal/cg/rtree.py", line 417, in insert
self._balance()
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pysal/cg/rtree.py", line 462, in _balance
[(silhouette_coeff(c, memo), c) for c in clusterings])
TypeError: '>' not supported between instances of '_NodeCursor' and '_NodeCursor'
Fix: Take the argmax of silhoutte score, and choose one cluster based on the index: Line number: 462: /Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pysal/cg/rtree.py
Replace below original code:
score, bestcluster = max(
[(silhouette_coeff(c, memo), c) for c in clusterings])
by the below code
import numpy as np
#print([(silhouette_coeff(c, memo), c) for c in clusterings])
sil_scores = [silhouette_coeff(c, memo) for c in clusterings]
best_sil_index = np.argmax(sil_scores)
bestcluster = clusterings[best_sil_index]
score = sil_scores[best_sil_index]
Hi, sorry for the delay on addressing this, but we're working to change implementations to pysal/region, a separate package released as region. We hope to have this done by the start of the summer, and I'll be sure to ping you when it occurs. Beyond this, I'll keep this open.
A simple fix would be to pass key=lambda x: x[0] as a keyword argument to max().
In /cg/rtree.py, replace lines 461-462
score, bestcluster = max(
[(silhouette_coeff(c, memo), c) for c in clusterings])
with this
score, bestcluster = max(
[(silhouette_coeff(c, memo), c) for c in clusterings], key=lambda x: x[0])
The function then works as expected even when there are two or more equal silhouette coefficients.
Note to self: Try out suggested fix here in /libpysal/cg/rtree.py.
@vinkolar @badwolf259 Do you have a reproducible example that generates silhouettes for two clusters are exactly the same?
It actually seems that rtree isn't even used in spopt. Closing for now. We can reopen if I am missing something.