spopt icon indicating copy to clipboard operation
spopt copied to clipboard

[WIP] allow for precomputed distance matrices

Open ljwolf opened this issue 4 years ago • 6 comments

This allows for precomputed distance matrices in the skater baseclass, skater.SpanningForest. For a usage example:

import numpy
from libpysal import weights
from scipy.spatial import distance_matrix
from spopt.region.skater import SpanningForest

r = numpy.random.normal(size=(10,2))
D = distance_matrix(r,r, p=1) #l1 metric
w = weights.lat2W(5,2) # 5 by 2 lattice

SpanningForest(dissimilarity='precomputed').fit(5, w, D)

Now, caveat emptor: the semantics are a little different here @Shruti-Patil, since this converts the score into minimizing the sum of dissimilarities within the clusters, rather than minimizing the distance between features and the feature centroid of the cluster.

ljwolf avatar Aug 12 '21 20:08 ljwolf

If you'd like to try it on your data, use pip install git+https://github.com/ljwolf/spopt and follow the example above.

ljwolf avatar Aug 12 '21 20:08 ljwolf

Codecov Report

Merging #188 (3aca9a9) into main (42520cc) will decrease coverage by 0.4%. The diff coverage is 33.3%.

Impacted file tree graph

@@           Coverage Diff           @@
##            main    #188     +/-   ##
=======================================
- Coverage   64.5%   64.1%   -0.4%     
=======================================
  Files         17      17             
  Lines       1771    1785     +14     
  Branches     343     350      +7     
=======================================
+ Hits        1143    1145      +2     
- Misses       576     583      +7     
- Partials      52      57      +5     
Impacted Files Coverage Δ
spopt/region/skater.py 76.4% <33.3%> (-5.9%) :arrow_down:

codecov[bot] avatar Aug 12 '21 20:08 codecov[bot]

Should we add a test for this or is it good without?

jGaboardi avatar Aug 13 '21 00:08 jGaboardi

solution for #187

jGaboardi avatar Aug 13 '21 01:08 jGaboardi

I am not sure I want to merge this unless more people beyond @Shruti-Patil find it useful.... In abstract, it seems like a good idea, and I'm all for user power. But, empirically, I haven't seen good performance when minimizing the pre-computed dissimilarities.

I suppose the trick is that

  • for the current implementation, decisions directly minimize the deviation of the data relative to its center (either median, mean, or any other user-supplied reduction).
  • In the precomputed case implemented here, we can only minimize the dissimilarity within the cluster. Elsewhere, there's no clear direction on whether this should be the total feature dissimilarity matrix, or the dissimilarity matrix after filtering by the possible joins (such that we're only considering the dissimilarity of "connected" observations).

So, without further empirical work (on our end) to verify which is the right score, I don't want this to land.

ljwolf avatar Aug 17 '21 13:08 ljwolf

without further empirical work (on our end) to verify which is the right score, I don't want this to land.

This is reasonable.

jGaboardi avatar Aug 17 '21 13:08 jGaboardi

@ljwolf Shall we go ahead and close this out as stale?

jGaboardi avatar Oct 22 '22 02:10 jGaboardi

The OP went in another direction for a solution, so let's close.

jGaboardi avatar Oct 24 '22 16:10 jGaboardi