libpysal icon indicating copy to clipboard operation
libpysal copied to clipboard

[WIP] Extend KNN neighbor search beyond coincident sites

Open sjsrey opened this issue 4 years ago • 15 comments

This is aimed at https://github.com/pysal/pysal/pull/1060.

If the approach makes sense, it can be extended to Kernel weights and DistanceBand weights. The latter will need slightly different handling for symmetry.

sjsrey avatar May 03 '20 18:05 sjsrey

Codecov Report

Merging #287 into master will increase coverage by 0.19%. The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #287      +/-   ##
==========================================
+ Coverage   80.99%   81.19%   +0.19%     
==========================================
  Files         115      115              
  Lines       11580    11713     +133     
==========================================
+ Hits         9379     9510     +131     
- Misses       2201     2203       +2     
Impacted Files Coverage Δ
libpysal/weights/distance.py 86.71% <100.00%> (+1.67%) :arrow_up:
libpysal/weights/tests/test_distance.py 97.18% <100.00%> (+0.09%) :arrow_up:
libpysal/cg/alpha_shapes.py 85.00% <0.00%> (-1.67%) :arrow_down:
libpysal/weights/tests/test_weights.py 99.65% <0.00%> (+<0.01%) :arrow_up:
libpysal/weights/tests/test_util.py 98.15% <0.00%> (+0.01%) :arrow_up:
libpysal/weights/weights.py 78.92% <0.00%> (+0.90%) :arrow_up:
libpysal/cg/tests/test_ashapes.py 96.49% <0.00%> (+1.03%) :arrow_up:
libpysal/weights/util.py 77.13% <0.00%> (+1.54%) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 14c7509...523b07f. Read the comment docs.

codecov[bot] avatar May 03 '20 18:05 codecov[bot]

I would super prefer #285... is there something wrong with #285?

EDIT: Yes, I (a) forgot to deal with ids and (b) tested on a scenario where th number of coincident points was usually quite small relative to k, so you always got the self-index in the output. This is now resolved.

ljwolf avatar May 04 '20 11:05 ljwolf

OK, to be clear, this does more than simply ensure that co-incident points are assigned valid (co-incident) neighbors, it forces the neighbors of co-incident points to be outside of the coincident set, correct?

ljwolf avatar May 04 '20 12:05 ljwolf

This method either ignores or fails on ids.

import libpysal, numpy, uuid

coordinates = numpy.random.random(size=(5,2))
coordinates = numpy.row_stack([coordinates]*5)

ids = [uuid.uuid4().hex for _ in range(100)] 

#fails, since we pass ids as id_order to resolve sorting issues from the dataframe
w = libpysal.weights.KNN(coordinates, id_order=ids) 
# ignores the ids because of libpysal#284
w = libpysal.weights.KNN(coordinates, ids=ids) 

tbf I think that's not really this issue's fault, but we'd need to address it.

ljwolf avatar May 04 '20 13:05 ljwolf

I wanted to add a remove_coincident=True argument that allows users to keep the coincident points but safely remove the self-neighbor. But, I can't really figure out how to get the ids/id_order stuff fixed without adding further changes.

ljwolf avatar May 04 '20 13:05 ljwolf

Just so I can understand, what happens now if you have different apartment units in a building (same coordinates)? Would they be considered neighbors of each other or will they be assigned neighbors in other buildings?

pedrovma avatar May 04 '20 13:05 pedrovma

I would super prefer #285... is there something wrong with #285?

EDIT: Yes, I (a) forgot to deal with ids and (b) tested on a scenario where th number of coincident points was usually quite small relative to k, so you always got the self-index in the output.

My bad was I didn't see #285 until after I did this.

sjsrey avatar May 04 '20 13:05 sjsrey

OK, to be clear, this does more than simply ensure that co-incident points are assigned valid (co-incident) neighbors, it forces the neighbors of co-incident points to be outside of the coincident set, correct?

Yes, that was the intention.

sjsrey avatar May 04 '20 13:05 sjsrey

Just so I can understand, what happens now if you have different apartment units in a building (same coordinates)? Would they be considered neighbors of each other or will they be assigned neighbors in other buildings?

This would have them assigned neighbors with non-zero distances.

There are two (at least) cases where this happens - the one you point out where the issue is the data lacks sufficient spatial information to spatially disambiguate units in the same apartment building, and the other is with repeat sales data. In the latter, the data may have a temporal attribute that can be used to differentiate the records.

sjsrey avatar May 04 '20 13:05 sjsrey

the data may have a temporal attribute that can be used to differentiate the records

Ah, brilliant! never even thought that you could just send (n,3) arrays into the constructor!

That'd be very strongly sensitive to scaling, right? For data measured in UTM at a monthly timeframe, your inter-temporal neighbors are much "closer" than your spatial neighbors, right? UTM coordinates are typically in thousands of meters, whereas months will be 0...11.

ljwolf avatar May 04 '20 14:05 ljwolf

I think I'd want to see

  • [ ] an option to force neighbors back to being drawn from the coincident set
  • [ ] ids to be handled
  • [ ] use the numpy-oriented fix in #285 instead of the for looping.

I also don't see how this needs to be ported to kernel/distance band weights, save for estimating the bandwidth of a kernel?

ljwolf avatar May 04 '20 14:05 ljwolf

the data may have a temporal attribute that can be used to differentiate the records

Ah, brilliant! never even thought that you could just send (n,3) arrays into the constructor!

That'd be very strongly sensitive to scaling, right? For data measured in UTM at a monthly timeframe, your inter-temporal neighbors are much "closer" than your spatial neighbors, right? UTM coordinates are typically in thousands of meters, whereas months will be 0...11.

Just to be clear, I wasn't thinking that they would be passing in (n,3) as the data for the knn search, with one of the three being the temporal coordinate. That would raise all kinds of scaling issues that you point out.

The motivation for the approach in the PR was that coincident spatial points violate the law (I think it was Keith Clarke's) that no two events can occupy the same point at the same time. In data sets where at temporal attribute is missing, the coincident points imply this happens.

Maybe the suggested approach should be an option, rather than the default. Another option would be to jigger the coordinates of the coincident points prior to the search - but that also comes with some possible side-effects.

sjsrey avatar May 04 '20 14:05 sjsrey

Maybe the suggested approach should be an option, rather than the default.

Yes, I think that's appropriate. I was trying to add a remove_coincident=False above and was running into the ids/id_order issues.

Another option would be to jigger the coordinates of the coincident points prior to the search

Sure, that'd be reasonable. I think the solution implemented in #285 though is just fine, and better than reducing the precision of the data through jittering.

ljwolf avatar May 04 '20 14:05 ljwolf

I'd like to merge #285 to fix the problem, and then return to this as a "ignore_coincident" option for the constructor...

ljwolf avatar Dec 10 '21 13:12 ljwolf