esda [WIP] add spatial correlogram function

add first draft of correlogram

Aug 04 '23 00:08 knaaptime

supercedes #253

Aug 04 '23 00:08 knaaptime

there's nothing harder than pulling off a successful rebase. I'll die on this hill.

Aug 04 '23 00:08 knaaptime

there's nothing harder than pulling off a successful rebase. I'll die on this hill.

Aug 04 '23 00:08 jGaboardi

this should be just about good to go, but it's still failing for stuff like join counts because the class stashes all sorts of things as attriubutes (like the W and its adjacency list) that cant get serialized by joblib, and LOSH/Spatial_Pearson which have a different signature that needs w.sparse

Aug 06 '23 02:08 knaaptime

Smart. The issue I was runnning into was that w.to_adjlist sorts the neighbors, so once you convert and try to subset the W, it and df are no longer aligned, which is a mess

--

Elijah Knaap, Ph.D. Senior Research Scientist & Associate Director Center for Open Geographical Science San Diego State University knaaptime.com | @knaaptime On Aug 6, 2023 at 1:47 AM -0700, Martin Fleischmann @.***>, wrote:

@martinfleis commented on this pull request. In esda/correlogram.py:
   with NotImplementedError as e:
       raise e("Only I, G, and C statistics are currently implemented")
if distance_type == "band":
   W = DistanceBand
elif distance_type == "knn":
   if max(distances) > gdf.shape[0] - 1:
       with ValueError as e:
           raise e("max number of neighbors must be less than or equal to n-1")
   W = KNN
else:
   with NotImplementedError as e:
       raise e("distance_type must be either `band` or `knn` ")
should be able to build the tree once and reuse it?

but in practice, im not seeing any real difference from starting a new W from scratch each time
To keep the structure, you can just assign weight=0 to everything above the set distance threshold. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

Aug 06 '23 15:08 knaaptime

Codecov Report

:x: Patch coverage is 77.89474% with 21 lines in your changes missing coverage. Please review. :white_check_mark: Project coverage is 82.7%. Comparing base (051c715) to head (45d19e1). :warning: Report is 2 commits behind head on main.

Files with missing lines	Patch %	Lines
esda/correlogram.py	77.7%	21 Missing :warning:

Additional details and impacted files

@@           Coverage Diff           @@
##            main    #259     +/-   ##
=======================================
+ Coverage   82.0%   82.7%   +0.6%     
=======================================
  Files         25      27      +2     
  Lines       3538    3825    +287     
=======================================
+ Hits        2902    3162    +260     
- Misses       636     663     +27

Files with missing lines	Coverage Δ
esda/__init__.py	`100.0% <100.0%> (ø)`
esda/correlogram.py	`77.7% <77.7%> (ø)`

... and 2 files with indirect coverage changes

:rocket: New features to boost your workflow:

:snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Oct 26 '23 23:10 codecov[bot]

what other tests do folks think are necessary for this?

Nov 09 '23 05:11 knaaptime

I think at least:

verify that the documented association measures work as expected
verify that a simple callable completes its computation

Then I think it's done!

Jan 26 '24 15:01 ljwolf

cool. I think 1 is covered by the existing test using I? just need to add a generic callable test for 2?

Jan 26 '24 15:01 knaaptime

covered by the test using I

Ah ok, in light of the previous discussion, fine! I meant that we should test all of the estimators we specify to ensure correctness.

Yes on the generic callable test!

Then, I think the last bit is just the test failures?

Jan 26 '24 16:01 ljwolf

I'm trying to start a re-run of these tests, but when I think you've protected your remote @knaaptime? I made @jGaboardi's whitespace edit, and tried the following:

Screenshot 2024-08-16 at 12 02 20

Aug 16 '24 11:08 ljwolf

i think there's something weird with my github at the moment. All my review comments say 'pending' too. Let me see

Aug 16 '24 15:08 knaaptime

Screenshot 2024-08-16 at 8 29 09 AM just checked this guy. try again?

Aug 16 '24 15:08 knaaptime

I've refreshed this

Sep 23 '25 20:09 ljwolf

OK, tests pass. it would be good for @martinfleis and/or @jGaboardi to consider whether their comments have been addressed? Please mark your reviews as addressed if done. This would be good to ship.

Sep 23 '25 20:09 ljwolf

I think safe to merge.

Sep 23 '25 21:09 jGaboardi

I suppose this can be merged though there are two comments I have.

I am not excited about new functionality being built on top of W instead of Graph, especially when all of esda's statistics this consumes are Graph native.
The notebook could use some care. At least the empty cells could be removed.

Sep 24 '25 07:09 martinfleis

@knaaptime I've added the nonparametric correlogram. As I understand it, it's lowess() predicting z*z.T using d. I think a similar insight powers something else(Sec. 2.2) landing in pysal soon from @weikang9009 and me. I'm using statsmodels to do this.

Looking into the W vs. Graph choice here, one challenge moving this to the Graph object is that the W allowed users to pass a pre-constructed KDTree and simply request new W from this prebuilt tree, but Graph does not. This makes the correlogram() function performant. If we adjust Graph to allow build_distance_band/build_knn to accept a KDTree input, then a migration path is clearer.

For now, I can build the slower wrapper, but I think this should be discussed.

Sep 24 '25 11:09 ljwolf

Looking into the W vs. Graph choice here, one challenge moving this to the Graph object is that the W allowed users to pass a pre-constructed KDTree and simply request new W from this prebuilt tree, but Graph does not. This makes the correlogram() function performant. If we adjust Graph to allow build_distance_band/build_knn to accept a KDTree input, then a migration path is clearer.

For now, I can build the slower wrapper, but I think this should be discussed.

Okay, let's keep it as is and make a required change in Graph to allow this.

Sep 24 '25 11:09 martinfleis

Having my hands in this code now, I'd suggest the following argument renames:

gdf should become geometry. We can always accept a GeoDataFrame and get geometry.geometry when needed, but users should also be able to pass gdf.geometry for explicitness.
variable should be a series or array, not a string. This matches the API elsewhere in esda. This came from segregation, which is mostly gdf/str based, right?
distances should be support. This will mimic other places (like in pointpats) where we use distances to refer to a pairwise distance matrix and support for the values at which a function is evaluated. Further, when distance_type=='knn', distances are not distances. Finally, this will also make it clearer for users to estimate a lowess on precomputed distances:

correlogram(
    gdf.geometry, 
    gdf.variable, 
    support=[0,1,2,3],
    statistic = 'lowess', 
    stat_kwargs = dict(
        metric='precomputed', 
        coordinates=my_distance_matrix
    )
)

If support is instead distances, things look confusing to me:

correlogram(
    gdf.geometry, 
    gdf.variable, 
    distances=[0,1,2,3], # why is this referring to specific values of...
    statistic = 'lowess', 
    stat_kwargs = dict(
        metric='precomputed', 
        coordinates=my_distance_matrix # this distance matrix, which is "coordinates"?
    )
)

Sep 24 '25 12:09 ljwolf

this all makes sense to me. Thanks @ljwolf !

Sep 24 '25 13:09 knaaptime

@ljwolf thanks for the lowess version, looks great. Should we also implement/document that this version works a little differently than geoda, which, apart from using lowesss also seems to work on an expanding donut (i.e. it trims lower-bound distances as it moves outward?)

Sep 24 '25 22:09 knaaptime

This does not use a donut. it's using the same lowess regression as in that section, estimating using only data near each support value.

But, I think we must set the lowess frac parameter in a data-dependent fashion, though, to match their behavior---it should depend on the data rather than constant. Setting it to 1/n_bins/2 would estimate using only data closest to each bin center.

Sep 25 '25 08:09 ljwolf

To explain more fully:

the _lowess_correlogram() part estimates a lowess regression on the correlation and distance:

z_i*z_j = f(d_{ij}) + u_{ij}

This is the equation listed in the GeoDa documentation. Unfortuately, we have to parameterise this a little differently from GeoDa no matter what we do, unless we want to implement our own lowess regressor or simply use the average (unregularized) correlation in bin.

statsmodels lowess lets us specify the points at which the lowess is calculated (xvals) and what fraction of the data should be used around each point (frac). This second part is because their approach uses a (novel to me!) interval knn-based method to track points near xvals.

Hence, we cannot quite do lowess over bins exactly like GeoDa; the bins must have dynamic width (in terms of distance) and represent a fixed fraction of the data. This has tradeoffs. Hence, I'm suggesting we deviate from GeoDa's parameterization and make the following tradeoffs:

set the end of the support to half the longest bounding box diagonal by default. This allows us to ensure coverage for both parametric and nonparametric calculations without requiring us to build the full distance matrix up front. But, it's often longer than we want to visualize, since correlations often quickly go to zero.
use only the upper-triangle of the covariance and distance matrices if the distance matrix is symmetric. GeoDa docs suggest using the full matrix is not always the best choice, and that they use it to avoid having bins with too-few observations. Given statsmodels lowess uses a knn local regression, we don't need to worry about having too few observations at each xval and the duplication just slows the routine down. If the distance matrix is not symmetric (as might happen if metric='precomputed'), we use the full matrix anyway.
set the lowess frac according to the typical fraction of data in each bin. Here, note that support won't always span all distances. This minimizes the re-use of points (with no reuse when bins are constant width and span a subset of the data). (a) calculate the implied bin widths from user-supplied support, (b) calculate the width of the first and last bin. (c) look up/down 1/2 the first/last bin width from the lowest/highest support value (bounded by zero on the left), and (d) calculate the fraction of distances that fall within this span: frac_span, the fraction of point pairs spanned by bins implied by the support values. (e) Then, we set the lowess frac to frac_span/n_bins.

For evenly spaced bins, this will assign each observation to its nearest support, even if support.max()+1/2*bin_width leaves off some pairs. For unevenly-spaced support, this will recycle observations across bins. I don't think we can get around this with statsmodels' lowess.

Sep 25 '25 13:09 ljwolf

This now has full tests and nonparametric is in the notebook. If we're fine with the defaults, this is ready to ship I think.

Sep 25 '25 15:09 ljwolf

~test failure looks to be a network error. I'm not sure how to re-run this?~ nvm. rerunning now.

Sep 25 '25 15:09 ljwolf

teamwork ftw

Sep 25 '25 18:09 knaaptime