spopt
spopt copied to clipboard
[WIP] allow for precomputed distance matrices
This allows for precomputed distance matrices in the skater baseclass, skater.SpanningForest. For a usage example:
import numpy
from libpysal import weights
from scipy.spatial import distance_matrix
from spopt.region.skater import SpanningForest
r = numpy.random.normal(size=(10,2))
D = distance_matrix(r,r, p=1) #l1 metric
w = weights.lat2W(5,2) # 5 by 2 lattice
SpanningForest(dissimilarity='precomputed').fit(5, w, D)
Now, caveat emptor: the semantics are a little different here @Shruti-Patil, since this converts the score into minimizing the sum of dissimilarities within the clusters, rather than minimizing the distance between features and the feature centroid of the cluster.
If you'd like to try it on your data, use pip install git+https://github.com/ljwolf/spopt and follow the example above.
Codecov Report
Merging #188 (3aca9a9) into main (42520cc) will decrease coverage by
0.4%. The diff coverage is33.3%.
@@ Coverage Diff @@
## main #188 +/- ##
=======================================
- Coverage 64.5% 64.1% -0.4%
=======================================
Files 17 17
Lines 1771 1785 +14
Branches 343 350 +7
=======================================
+ Hits 1143 1145 +2
- Misses 576 583 +7
- Partials 52 57 +5
| Impacted Files | Coverage Δ | |
|---|---|---|
| spopt/region/skater.py | 76.4% <33.3%> (-5.9%) |
:arrow_down: |
Should we add a test for this or is it good without?
solution for #187
I am not sure I want to merge this unless more people beyond @Shruti-Patil find it useful.... In abstract, it seems like a good idea, and I'm all for user power. But, empirically, I haven't seen good performance when minimizing the pre-computed dissimilarities.
I suppose the trick is that
- for the current implementation, decisions directly minimize the deviation of the data relative to its center (either median, mean, or any other user-supplied reduction).
- In the precomputed case implemented here, we can only minimize the dissimilarity within the cluster. Elsewhere, there's no clear direction on whether this should be the total feature dissimilarity matrix, or the dissimilarity matrix after filtering by the possible joins (such that we're only considering the dissimilarity of "connected" observations).
So, without further empirical work (on our end) to verify which is the right score, I don't want this to land.
without further empirical work (on our end) to verify which is the right score, I don't want this to land.
This is reasonable.
@ljwolf Shall we go ahead and close this out as stale?
The OP went in another direction for a solution, so let's close.