[WIP] add spatial correlogram function
add first draft of correlogram
supercedes #253
there's nothing harder than pulling off a successful rebase. I'll die on this hill.
there's nothing harder than pulling off a successful rebase. I'll die on this hill.
this should be just about good to go, but it's still failing for stuff like join counts because the class stashes all sorts of things as attriubutes (like the W and its adjacency list) that cant get serialized by joblib, and LOSH/Spatial_Pearson which have a different signature that needs w.sparse
Smart. The issue I was runnning into was that w.to_adjlist sorts the neighbors, so once you convert and try to subset the W, it and df are no longer aligned, which is a mess
--
Elijah Knaap, Ph.D. Senior Research Scientist & Associate Director Center for Open Geographical Science San Diego State University knaaptime.com | @knaaptime On Aug 6, 2023 at 1:47 AM -0700, Martin Fleischmann @.***>, wrote:
@martinfleis commented on this pull request. In esda/correlogram.py:
with NotImplementedError as e:
raise e("Only I, G, and C statistics are currently implemented")- if distance_type == "band":
W = DistanceBand- elif distance_type == "knn":
if max(distances) > gdf.shape[0] - 1:with ValueError as e:raise e("max number of neighbors must be less than or equal to n-1")W = KNN- else:
with NotImplementedError as e:raise e("distance_type must be either `band` or `knn` ")should be able to build the tree once and reuse it?
but in practice, im not seeing any real difference from starting a new W from scratch each time
To keep the structure, you can just assign weight=0 to everything above the set distance threshold. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>
Codecov Report
:x: Patch coverage is 77.89474% with 21 lines in your changes missing coverage. Please review.
:white_check_mark: Project coverage is 82.7%. Comparing base (051c715) to head (45d19e1).
:warning: Report is 2 commits behind head on main.
| Files with missing lines | Patch % | Lines |
|---|---|---|
| esda/correlogram.py | 77.7% | 21 Missing :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## main #259 +/- ##
=======================================
+ Coverage 82.0% 82.7% +0.6%
=======================================
Files 25 27 +2
Lines 3538 3825 +287
=======================================
+ Hits 2902 3162 +260
- Misses 636 663 +27
| Files with missing lines | Coverage Δ | |
|---|---|---|
| esda/__init__.py | 100.0% <100.0%> (ø) |
|
| esda/correlogram.py | 77.7% <77.7%> (ø) |
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
what other tests do folks think are necessary for this?
I think at least:
- verify that the documented association measures work as expected
- verify that a simple callable completes its computation
Then I think it's done!
cool. I think 1 is covered by the existing test using I? just need to add a generic callable test for 2?
covered by the test using I
Ah ok, in light of the previous discussion, fine! I meant that we should test all of the estimators we specify to ensure correctness.
Yes on the generic callable test!
Then, I think the last bit is just the test failures?
I'm trying to start a re-run of these tests, but when I think you've protected your remote @knaaptime? I made @jGaboardi's whitespace edit, and tried the following:
i think there's something weird with my github at the moment. All my review comments say 'pending' too. Let me see
I've refreshed this
OK, tests pass. it would be good for @martinfleis and/or @jGaboardi to consider whether their comments have been addressed? Please mark your reviews as addressed if done. This would be good to ship.
I think safe to merge.
I suppose this can be merged though there are two comments I have.
- I am not excited about new functionality being built on top of W instead of Graph, especially when all of esda's statistics this consumes are Graph native.
- The notebook could use some care. At least the empty cells could be removed.
@knaaptime I've added the nonparametric correlogram. As I understand it, it's lowess() predicting z*z.T using d. I think a similar insight powers something else(Sec. 2.2) landing in pysal soon from @weikang9009 and me. I'm using statsmodels to do this.
Looking into the W vs. Graph choice here, one challenge moving this to the Graph object is that the W allowed users to pass a pre-constructed KDTree and simply request new W from this prebuilt tree, but Graph does not. This makes the correlogram() function performant. If we adjust Graph to allow build_distance_band/build_knn to accept a KDTree input, then a migration path is clearer.
For now, I can build the slower wrapper, but I think this should be discussed.
Looking into the W vs. Graph choice here, one challenge moving this to the Graph object is that the W allowed users to pass a pre-constructed KDTree and simply request new W from this prebuilt tree, but Graph does not. This makes the correlogram() function performant. If we adjust Graph to allow build_distance_band/build_knn to accept a KDTree input, then a migration path is clearer.
For now, I can build the slower wrapper, but I think this should be discussed.
Okay, let's keep it as is and make a required change in Graph to allow this.
Having my hands in this code now, I'd suggest the following argument renames:
-
gdfshould becomegeometry. We can always accept aGeoDataFrameand getgeometry.geometrywhen needed, but users should also be able to passgdf.geometryfor explicitness. -
variableshould be a series or array, not a string. This matches the API elsewhere inesda. This came fromsegregation, which is mostlygdf/strbased, right? -
distancesshould besupport. This will mimic other places (like inpointpats) where we usedistancesto refer to a pairwise distance matrix andsupportfor the values at which a function is evaluated. Further, whendistance_type=='knn',distancesare not distances. Finally, this will also make it clearer for users to estimate a lowess on precomputed distances:
correlogram(
gdf.geometry,
gdf.variable,
support=[0,1,2,3],
statistic = 'lowess',
stat_kwargs = dict(
metric='precomputed',
coordinates=my_distance_matrix
)
)
If support is instead distances, things look confusing to me:
correlogram(
gdf.geometry,
gdf.variable,
distances=[0,1,2,3], # why is this referring to specific values of...
statistic = 'lowess',
stat_kwargs = dict(
metric='precomputed',
coordinates=my_distance_matrix # this distance matrix, which is "coordinates"?
)
)
this all makes sense to me. Thanks @ljwolf !
@ljwolf thanks for the lowess version, looks great. Should we also implement/document that this version works a little differently than geoda, which, apart from using lowesss also seems to work on an expanding donut (i.e. it trims lower-bound distances as it moves outward?)
This does not use a donut. it's using the same lowess regression as in that section, estimating using only data near each support value.
But, I think we must set the lowess frac parameter in a data-dependent fashion, though, to match their behavior---it should depend on the data rather than constant. Setting it to 1/n_bins/2 would estimate using only data closest to each bin center.
To explain more fully:
the _lowess_correlogram() part estimates a lowess regression on the correlation and distance:
z_i*z_j = f(d_{ij}) + u_{ij}
This is the equation listed in the GeoDa documentation. Unfortuately, we have to parameterise this a little differently from GeoDa no matter what we do, unless we want to implement our own lowess regressor or simply use the average (unregularized) correlation in bin.
statsmodels lowess lets us specify the points at which the lowess is calculated (xvals) and what fraction of the data should be used around each point (frac). This second part is because their approach uses a (novel to me!) interval knn-based method to track points near xvals.
Hence, we cannot quite do lowess over bins exactly like GeoDa; the bins must have dynamic width (in terms of distance) and represent a fixed fraction of the data. This has tradeoffs. Hence, I'm suggesting we deviate from GeoDa's parameterization and make the following tradeoffs:
- set the end of the support to half the longest bounding box diagonal by default. This allows us to ensure coverage for both parametric and nonparametric calculations without requiring us to build the full distance matrix up front. But, it's often longer than we want to visualize, since correlations often quickly go to zero.
-
use only the upper-triangle of the covariance and distance matrices if the distance matrix is symmetric. GeoDa docs suggest using the full matrix is not always the best choice, and that they use it to avoid having bins with too-few observations. Given
statsmodelslowess uses aknnlocal regression, we don't need to worry about having too few observations at eachxvaland the duplication just slows the routine down. If the distance matrix is not symmetric (as might happen ifmetric='precomputed'), we use the full matrix anyway. -
set the lowess
fracaccording to the typical fraction of data in each bin. Here, note thatsupportwon't always span all distances. This minimizes the re-use of points (with no reuse when bins are constant width and span a subset of the data). (a) calculate the implied bin widths from user-suppliedsupport, (b) calculate the width of the first and last bin. (c) look up/down 1/2 the first/last bin width from the lowest/highest support value (bounded by zero on the left), and (d) calculate the fraction of distances that fall within this span:frac_span, the fraction of point pairs spanned by bins implied by the support values. (e) Then, we set the lowessfractofrac_span/n_bins.
For evenly spaced bins, this will assign each observation to its nearest support, even if support.max()+1/2*bin_width leaves off some pairs. For unevenly-spaced support, this will recycle observations across bins. I don't think we can get around this with statsmodels' lowess.
This now has full tests and nonparametric is in the notebook. If we're fine with the defaults, this is ready to ship I think.
~test failure looks to be a network error. I'm not sure how to re-run this?~ nvm. rerunning now.
teamwork ftw