esda icon indicating copy to clipboard operation
esda copied to clipboard

add start of local partial moran statistics

Open ljwolf opened this issue 1 year ago • 14 comments

This adds my partial and auxiliary-regressive local Moran statistics forthcoming in the annals. This needs

  • [x] tests
  • [x] example notebook

ljwolf avatar Jan 26 '24 16:01 ljwolf

Codecov Report

:x: Patch coverage is 96.90722% with 6 lines in your changes missing coverage. Please review. :white_check_mark: Project coverage is 82.8%. Comparing base (051c715) to head (4f13851). :warning: Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
esda/moran_local_mv.py 97.3% 5 Missing :warning:
esda/significance.py 85.7% 1 Missing :warning:
Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##            main    #279     +/-   ##
=======================================
+ Coverage   82.0%   82.8%   +0.8%     
=======================================
  Files         25      26      +1     
  Lines       3538    3730    +192     
=======================================
+ Hits        2902    3088    +186     
- Misses       636     642      +6     
Files with missing lines Coverage Δ
esda/__init__.py 100.0% <100.0%> (ø)
esda/significance.py 94.1% <85.7%> (-1.5%) :arrow_down:
esda/moran_local_mv.py 97.3% <97.3%> (ø)
:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov[bot] avatar Jan 26 '24 16:01 codecov[bot]

now that the paper has hit the annals, I need to get this merged. I've resumed stubbing the tests out, and they should be simple to finalize. I'm hopeful that we can get this released in the next ESDA iteration.

I think that, after writing another two class in this frame, it would help to do a similar RFC on an api for ESDA-style interfaces that we can agree upon for the future. I think we need to start developing a more consistent interface, with attributes that have less specificity to the implementation of the class (p_sim vs. p_norm; I vs. associations_ vs. Gi, etc.)

I hope to have this RFC written for the development meeting next week.

ljwolf avatar Jul 04 '24 11:07 ljwolf

Not sure why the CI is failing on the pygrio+geodatasets read?

ljwolf avatar Aug 15 '24 16:08 ljwolf

OK, it looks like the geodatasets download is malformed, so the pyogrio or fiona read fails for one or more workers. When that happens, the process fails for test_moran_local_mv.py, and we get the discrepancy between the runners. Maybe a fixture will solve it?

ljwolf avatar Aug 16 '24 11:08 ljwolf

OK, it looks like the geodatasets download is malformed

the issue is that multiple workers are trying to donwload and save the same file at the same time. We need to pull the data ahead of time to make that work as we do in lib https://github.com/pysal/libpysal/blob/bef1e81aa35c9372e91a142e9fa5c8c0c44c55ee/.github/workflows/unittests.yml#L64-L76

martinfleis avatar Aug 16 '24 11:08 martinfleis

I think that's one useful solution! I think I would rather do this as a pytest.fixture, however, since that allows us to specify the scope of the fixture, and for pytest then to understand that scope.

:param scope:
    The scope for which this fixture is shared; one of ``"function"``
    (default), ``"class"``, ``"module"``, ``"package"`` or ``"session"``.

    This parameter may also be a callable which receives ``(fixture_name, config)``
    as parameters, and must return a ``str`` with one of the values mentioned above.

    See :ref:`dynamic scope` in the docs for more information.

Using the module-level fixture, this passes now. We should probably define the datasets you're linking to in libpysal/tests/__init__.py as pytest.fixtures at the package level.

ljwolf avatar Aug 16 '24 11:08 ljwolf

Yeah, that is one option. I am not a big fan of it as tests with a ton of fixtures are pain to debug if you want to re-run those in ipython or a notebook but it is one way of solving this.

martinfleis avatar Aug 16 '24 13:08 martinfleis

OK! I'm fine with that, too. I like fixtures, since they can encapsulate both the parametrize step and can compose/depend correctly across parallel jobs. So, fine to keep what's in libpysal anyway, if we're cool using fixtures here.

ljwolf avatar Aug 16 '24 14:08 ljwolf

So, fine to keep what's in libpysal anyway, if we're cool using fixtures here.

I'm fine with either, I just find working with fixtures a bit opaque occasionally and a long sequence of them is super pain to replicate in a notebook.

martinfleis avatar Aug 16 '24 17:08 martinfleis

OK, this should be ready to review in full/merge. The example notebook is complete.

This commits the estimators in the same convention as the existing Moran estimators, but uses a slightly different convention for describing the statistic outputs. I would like to use that as the basis for new classes in esda, and hope to have that by the next dev meeting (again 😄 )

ljwolf avatar Aug 19 '24 12:08 ljwolf

Tests were all passing in 29b231, and new failures are in unrelated code in topo. Not sure the origin yet.

ljwolf avatar Aug 19 '24 12:08 ljwolf

Tests were all passing in 29b231, and new failures are in unrelated code in topo. Not sure the origin yet.

Incompatibility of shapely with numpy 2.1 released yesterday.

martinfleis avatar Aug 19 '24 12:08 martinfleis

OK, this should be ready to review in full/merge. The example notebook is complete.

This commits the estimators in the same convention as the existing Moran estimators, but uses a slightly different convention for describing the statistic outputs. I would like to use that as the basis for new classes in esda, and hope to have that by the next dev meeting (again 😄 )

I'm looking foward to the discussion. One question is how closely the plan is to track scikit-learn conventions, as the implementation in moran_local_mv.py does not follow the "parameters with trailing underscore _ are not to be set inside the __init__ method" rule.^sk_

A point for discussion at the dev meeting.

This is looking really nice!

sjsrey avatar Aug 19 '24 18:08 sjsrey

I've updated the API to follow that resource (I think?). Only configuration options are set in the init.

When I mentioned using this PR as a prototype for a new spatial statistics API, I meant exactly this!

One thing I'd like feedback here is how much data to cache. Here, for example, in the Partial_Moran_Local() class, I'm caching three (n_samples, n_features) matrices (self.D, self.R, self._partials_), plus the (n_samples,n_permutations,n_variables) matrix self._rlmos_).

Generally, sklearn classes avoid caching the input data if possible, but this means I'd need to remove the properties and just assign directly the attributes we want to keep. For naming, I'm thinking:

  • association_/associations_: the global/local measure of spatial association that the statistic expresses.
  • labels_: the "label", if any, for common local statistics. Here, I'm thinking just the quadrant for Moran-style local covariance statistics, or hotspot/coldspot for the Getis-Ord statistics.
  • significance_/significances_: local p-values that are generated according to an option passed in the __init__(). These will generally be calculated in the fit() object, if we want to avoid caching data on the object itself.
  • reference_distribution_: if permutation inference is used and the random values are cached, this should be where the simulations are stored.
  • partials_: for any cross-product style statistic, this should record the two vectors that are crossed before calculating a global statistic, or elementwise-multiplied before calculating a local statistic. This is to facilitate visualization later, as well as allow for new labelling in a .transform() method.

Also, not quite sure how to handle the fact that you have one statistic for the global but many for the local? I would like to have one name for both, but if feels wrong to use the plural for the global. I could see using the singular (association_, significance_) for everybody, but then labels_ sticks out as odd to me.

ljwolf avatar Nov 12 '24 10:11 ljwolf

Hmm... there are yaml parsing errors here---did I fail to resolve a merge conflict correctly?

ljwolf avatar Sep 16 '25 16:09 ljwolf

yes, I think I did. fixing...

ljwolf avatar Sep 16 '25 16:09 ljwolf

This is also now passing and I've taken a proofread of the notebook.

Could @knaaptime @jGaboardi @martinfleis re-review?

ljwolf avatar Sep 16 '25 16:09 ljwolf

@ljwolf Did you verify the docs build locally?

jGaboardi avatar Sep 16 '25 17:09 jGaboardi

@jGaboardi I've done so now. We talked about the following in the call:

  • [x] switch to CamelCase
  • [x] proofread/edit notebook
  • [x] move to singular nouns (association_, significance_, reference_distribution_) when these would look strange for univariate statistics, but keep plurals (partials_, labels_) for others
  • [x] add to __init__.py
  • [x] render docs and update gallery

But, I've now started moving the implementation to new significance engine from #281, so I will need to update the tests. Then, this is ready to ship!

ljwolf avatar Sep 18 '25 08:09 ljwolf

@ljwolf ping me after updating tests and I will give a final look as soon as I am able.

jGaboardi avatar Sep 18 '25 12:09 jGaboardi

For some reason, the W and the Graph are not the same! I'm investigating... most tests pass. I'm investigating...

ljwolf avatar Sep 23 '25 15:09 ljwolf

The graphs are not the same because the default in Graph.build_contiguity() switched from Rook to Queen weights over the time this was developed. Fixed now.

ljwolf avatar Sep 23 '25 19:09 ljwolf

awesome work @ljwolf !

jGaboardi avatar Sep 23 '25 19:09 jGaboardi

OK! should be fine to review now. Thanks for your patience!

ljwolf avatar Sep 23 '25 19:09 ljwolf

Time to merge?

jGaboardi avatar Sep 24 '25 15:09 jGaboardi

Yes please!

ljwolf avatar Sep 24 '25 15:09 ljwolf