libpysal
libpysal copied to clipboard
[WIP] Extend KNN neighbor search beyond coincident sites
This is aimed at https://github.com/pysal/pysal/pull/1060.
If the approach makes sense, it can be extended to Kernel weights and DistanceBand weights. The latter will need slightly different handling for symmetry.
Codecov Report
Merging #287 into master will increase coverage by
0.19%
. The diff coverage is100.00%
.
@@ Coverage Diff @@
## master #287 +/- ##
==========================================
+ Coverage 80.99% 81.19% +0.19%
==========================================
Files 115 115
Lines 11580 11713 +133
==========================================
+ Hits 9379 9510 +131
- Misses 2201 2203 +2
Impacted Files | Coverage Δ | |
---|---|---|
libpysal/weights/distance.py | 86.71% <100.00%> (+1.67%) |
:arrow_up: |
libpysal/weights/tests/test_distance.py | 97.18% <100.00%> (+0.09%) |
:arrow_up: |
libpysal/cg/alpha_shapes.py | 85.00% <0.00%> (-1.67%) |
:arrow_down: |
libpysal/weights/tests/test_weights.py | 99.65% <0.00%> (+<0.01%) |
:arrow_up: |
libpysal/weights/tests/test_util.py | 98.15% <0.00%> (+0.01%) |
:arrow_up: |
libpysal/weights/weights.py | 78.92% <0.00%> (+0.90%) |
:arrow_up: |
libpysal/cg/tests/test_ashapes.py | 96.49% <0.00%> (+1.03%) |
:arrow_up: |
libpysal/weights/util.py | 77.13% <0.00%> (+1.54%) |
:arrow_up: |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 14c7509...523b07f. Read the comment docs.
I would super prefer #285... is there something wrong with #285?
EDIT: Yes, I (a) forgot to deal with ids and (b) tested on a scenario where th number of coincident points was usually quite small relative to k
, so you always got the self-index in the output. This is now resolved.
OK, to be clear, this does more than simply ensure that co-incident points are assigned valid (co-incident) neighbors, it forces the neighbors of co-incident points to be outside of the coincident set, correct?
This method either ignores or fails on ids.
import libpysal, numpy, uuid
coordinates = numpy.random.random(size=(5,2))
coordinates = numpy.row_stack([coordinates]*5)
ids = [uuid.uuid4().hex for _ in range(100)]
#fails, since we pass ids as id_order to resolve sorting issues from the dataframe
w = libpysal.weights.KNN(coordinates, id_order=ids)
# ignores the ids because of libpysal#284
w = libpysal.weights.KNN(coordinates, ids=ids)
tbf I think that's not really this issue's fault, but we'd need to address it.
I wanted to add a remove_coincident=True
argument that allows users to keep the coincident points but safely remove the self-neighbor. But, I can't really figure out how to get the ids/id_order
stuff fixed without adding further changes.
Just so I can understand, what happens now if you have different apartment units in a building (same coordinates)? Would they be considered neighbors of each other or will they be assigned neighbors in other buildings?
I would super prefer #285... is there something wrong with #285?
EDIT: Yes, I (a) forgot to deal with ids and (b) tested on a scenario where th number of coincident points was usually quite small relative to
k
, so you always got the self-index in the output.
My bad was I didn't see #285 until after I did this.
OK, to be clear, this does more than simply ensure that co-incident points are assigned valid (co-incident) neighbors, it forces the neighbors of co-incident points to be outside of the coincident set, correct?
Yes, that was the intention.
Just so I can understand, what happens now if you have different apartment units in a building (same coordinates)? Would they be considered neighbors of each other or will they be assigned neighbors in other buildings?
This would have them assigned neighbors with non-zero distances.
There are two (at least) cases where this happens - the one you point out where the issue is the data lacks sufficient spatial information to spatially disambiguate units in the same apartment building, and the other is with repeat sales data. In the latter, the data may have a temporal attribute that can be used to differentiate the records.
the data may have a temporal attribute that can be used to differentiate the records
Ah, brilliant! never even thought that you could just send (n,3)
arrays into the constructor!
That'd be very strongly sensitive to scaling, right? For data measured in UTM at a monthly timeframe, your inter-temporal neighbors are much "closer" than your spatial neighbors, right? UTM coordinates are typically in thousands of meters, whereas months will be 0...11.
I think I'd want to see
- [ ] an option to force neighbors back to being drawn from the coincident set
- [ ] ids to be handled
- [ ] use the numpy-oriented fix in #285 instead of the for looping.
I also don't see how this needs to be ported to kernel/distance band weights, save for estimating the bandwidth of a kernel?
the data may have a temporal attribute that can be used to differentiate the records
Ah, brilliant! never even thought that you could just send
(n,3)
arrays into the constructor!That'd be very strongly sensitive to scaling, right? For data measured in UTM at a monthly timeframe, your inter-temporal neighbors are much "closer" than your spatial neighbors, right? UTM coordinates are typically in thousands of meters, whereas months will be 0...11.
Just to be clear, I wasn't thinking that they would be passing in (n,3) as the data for the knn search, with one of the three being the temporal coordinate. That would raise all kinds of scaling issues that you point out.
The motivation for the approach in the PR was that coincident spatial points violate the law (I think it was Keith Clarke's) that no two events can occupy the same point at the same time. In data sets where at temporal attribute is missing, the coincident points imply this happens.
Maybe the suggested approach should be an option, rather than the default. Another option would be to jigger the coordinates of the coincident points prior to the search - but that also comes with some possible side-effects.
Maybe the suggested approach should be an option, rather than the default.
Yes, I think that's appropriate. I was trying to add a remove_coincident=False
above and was running into the ids
/id_order
issues.
Another option would be to jigger the coordinates of the coincident points prior to the search
Sure, that'd be reasonable. I think the solution implemented in #285 though is just fine, and better than reducing the precision of the data through jittering.
I'd like to merge #285 to fix the problem, and then return to this as a "ignore_coincident" option for the constructor...