API: Change the handling of IDs in from_dataframe constructors
As a follow-up of the discussion we had yesterday during the dev call, I am starting this PR more as a place for a discussion on top of a code rather than an actual code change.
I have created a from_dataframe_new method to match the ideal outcome of the transition from ids, idVariable and id_order to a cleaner API. What needs to be figured out is a transitioning phase.
Looking at the code, I also think that @ljwolf's suggestion to deprecate id_order should be done at the same time as the rest of the changes as these are quite closely linked together.
There is one question we haven't discussed though. What shall we do with the actual dataframe index? We currently completely ignore it and when no ids are passed use positional indexing to encode weights. But the most sensible default would be to use the dataframe index instead imho. However, this is quite a big breaking change, so I am not sure what would be the best on that front. edit: the docstring now says If nothing is provided, the dataframe index is used but the reality does not follow that.
The API change proposed here on the example of Rook should be then mirrored to all relevant from_dataframe methods. In some cases (KNN, DistanceBand), it involves nearly no change as ids behave as we want it and there is no idVariable. So the main goal here is to ensure the API is consistent.
Relevant discussions: #473, #183, #184 (which does essentially the same thing as this PR), #102
Codecov Report
Merging #477 (c796673) into master (d291e01) will increase coverage by
0.0%. The diff coverage is69.9%.
@@ Coverage Diff @@
## master #477 +/- ##
======================================
Coverage 78.8% 78.8%
======================================
Files 122 122
Lines 13100 13161 +61
======================================
+ Hits 10320 10371 +51
- Misses 2780 2790 +10
| Impacted Files | Coverage Δ | |
|---|---|---|
| libpysal/weights/contiguity.py | 77.2% <64.4%> (-1.9%) |
:arrow_down: |
| libpysal/weights/gabriel.py | 47.1% <65.2%> (+3.3%) |
:arrow_up: |
| libpysal/weights/distance.py | 85.4% <77.8%> (ø) |
|
| libpysal/weights/tests/test_contiguity.py | 95.1% <100.0%> (+0.3%) |
:arrow_up: |
| libpysal/_version.py | 40.7% <0.0%> (+2.7%) |
:arrow_up: |
I have created a
from_dataframe_newmethod to match the ideal outcome of the transition fromids,idVariableandid_orderto a cleaner API. What needs to be figured out is a transitioning phase.
So the plan would be to deprecate idVariable and id_order, while refactoring ids for a new api, correct?
If yes, I can start exploring deprecation.
On a related note, turning off the sorting of the ids on init can be done. I explored this https://github.com/pysal/libpysal/commit/13c89b68038b5cf2b4bcf97abe1f14162b947ac0 and two tests broke in all of weights. These had to do with block_weights in both cases. I updated the doctests and unittests to correct for the case of turning off sort on init https://github.com/pysal/libpysal/commit/fc0f8b6ee2545d73a3727c14f06ba4d571838ce5. So removing the behavior of sorting on init could be done in the no sort branch.
So the plan would be to deprecate idVariable and id_order, while refactoring ids for a new api, correct?
Yes. And figure out what to do with positional index vs the dataframe index of when nothing else is passed.
So the plan would be to deprecate idVariable and id_order, while refactoring ids for a new api, correct?
Yes. And figure out what to do with positional index vs the dataframe index of when nothing else is passed.
Since 3.6, dicts order the keys based on insertion order, so the positional index for our w would be w.neighbors.keys(). Prior to 3.6 we had to do all the id_sort business to support alignment between w and the arrays they would be applied to (in spatial lags).
So one thought is the new w.ids could be a property with a getter that returns w.neighborhs.keys().
Users could then sort the array to align with the w.ids order and then pass the sorted array into lag_spatial or we provide a way to reorder the w.ids and pass that into the lag_spatial function with the original (unsorted) array.
The timing for the changes might be something like this:
For meta 23.01 released Jan 2023, deprecation warnings have been added for the old id-related api.
We then target 23.07 (July release) for the new api.
Maintaining two api's seems like a painful route, so the above is suggested as a way to avoid this.
What do devs think?
Not sure I fully understand. Esp. this
Maintaining two api's seems like a painful route, so the above is suggested as a way to avoid this.
I agree that we should not have two APIs but we need a transition period. We cannot just raise a warning in January and make the change in June. Everything breaks in June if you don't give a way of adjusting to the new API in January.
Not sure I fully understand. Esp. this
Maintaining two api's seems like a painful route, so the above is suggested as a way to avoid this.
I agree that we should not have two APIs but we need a transition period. We cannot just raise a warning in January and make the change in June. Everything breaks in June if you don't give a way of adjusting to the new API in January.
Not sure I fully understand. Esp. this
Maintaining two api's seems like a painful route, so the above is suggested as a way to avoid this.
I agree that we should not have two APIs but we need a transition period. We cannot just raise a warning in January and make the change in June. Everything breaks in June if you don't give a way of adjusting to the new API in January.
What I am trying to understand is how we go about this. Given the nature of the changes we are proposing, I can't see a way (yet) where say; [1] we have deprecation warnings around ids and id_order which would permit legacy code to work, but alert the user to upcoming changes; and [2] also allow for the new functionality of ids in order to point the user to the functionality to adopt.
Its easier if we have a mapping of: old_property -> new_property with different names. For ids we don't have this to leverage as its ids -> ids.
Maybe this suggests a different name for the property/method in the new API?
old is ids new maybe something like iloc and loc?
The other problem is that if we deprecate idVariable and the sort of ids on init, then that functionality has to coexist peacefully with the new API (which we agree should not sort on init).
Fleshing all this out here is really helpful :->
We don't have to deprecate anything around ids, we just make that variable smarter to consume either string representing the columns name, or any array-like of the matching length. That is fully backwards compatible and can be done immediately.
At the same time, we can raise a deprecation warning on idVariable but pass its value temporarily to enhanced ids.
The id_order can now raise a warning and since we want to remove it without a replacement, that is it.
One thing I still don't know how to approach is the issue of index vs positional index. We currently claim in the docstring that index is used automatically but it is not. If we just start doing that instead of using the position (iloc), but may break some things (e.g. it would break small bits in momepy). So I would say we should add a new keyword to control that behaviour.
Something like this
@classmethod
def from_dataframe(
cls, df, geom_col=None, idVariable=None, ids=None, id_order=None, use_index=None, **kwargs
):
"""
Construct a weights object from a pandas dataframe with a geometry
column. This will cast the polygons to PySAL polygons, then build the W
using ids from the dataframe.
Parameters
----------
df : DataFrame
a :class: `pandas.DataFrame` containing geometries to use
for spatial weights
geom_col : string
the name of the column in `df` that contains the
geometries. Defaults to active geometry column.
idVariable : string
the name of the column to use as IDs. If nothing is
provided, the dataframe index is used DEPRECATED - use ids
ids : list
a list of ids to use to index the spatial weights object.
Order is not respected from this list.
id_order : list
an ordered list of ids to use to index the spatial weights
object. If used, the resulting weights object will iterate
over results in the order of the names provided in this
argument. DEPRECATED
use_index : bool
use index of `df` as ids to index the spatial weights object
See Also
--------
:class:`libpysal.weights.weights.W`
:class:`libpysal.weights.contiguity.Rook`
"""
if geom_col is None:
geom_col = df.geometry.name
if id_order is not None:
warnings.warn("deprecated, don't use")
if id_order is True and ((idVariable is not None) or (ids is not None)):
# if idVariable is None, we want ids. Otherwise, we want the
# idVariable column
id_order = list(df.get(idVariable, ids))
else:
id_order = df.get(id_order, ids)
if idVariable is not None:
if ids is None:
warnings.warn("deprecated, use ids")
ids = idVariable
else:
warnings.warn("both idVariable and ids passed, using ids")
if ids is None:
if use_index is None:
warnings.warn("use_index defaults to False but will default to True in future. Set True/False directly to control this behaviour and silence this warning")
use_index = False
if use_index:
ids = df.index.tolist()
if isinstance(ids, str):
ids = df[ids]
if not isinstance(ids, list):
ids = ids.tolist()
return cls.from_iterable(
df[geom_col].tolist(), ids=ids, id_order=id_order, **kwargs
This already supports all we want in the future but is backwards compatible and provides a deprecation period for transition.
I have updated from_dataframe in Rook, and Queen, adding the deprecation and ensuring all is controllable (use of index or not) and matching the docstring. I have also updated from_dataframe in KNN, Kernel, and DistanceBand by adding use_index keyword to ensure consistency between all methods.
All changes are backwards compatible. In contiguity, we will want to change the default in use_index to True to match the default in distance weights (that has already been used).
@ljwolf how are Delaunay weights indexed? It would be good to have the same options there if possible.
They only take input, and do no id processing iirc, so they create a range index. They're in weights.gabriel
Okay, I added the same options to handle index to Delaunay.from_dataframe as well. The default on use_index will need to change there as well, the situation is the same as we have in contiguity right now.
OK, still some odd behavior around weights ordering here, akin to the issues I outlined in the test bench from #184. I think that part of this fix has to be stopping sorting ids by default.
An example using FIPS codes from the south example data:
import geopandas
from libpysal import weights, examples
examples.load_example("south.shp") # I can't locate south if I don't load it first, even if it's downloaded?
south = geopandas.read_file(examples.get_path("south.shp"))
south_shuffled = south.sample(frac=1, replace=False)
print(weights.Rook.from_dataframe(south, ids="FIPS").id_order[:5])
['01001', '01003', '01005', '01007', '01009']
print(weights.DistanceBand.from_dataframe(south, threshold=2, ids="FIPS").id_order[:5])
['54029', '54009', '54069', '54051', '10003']
Sorting of the ids is happening for contiguity weights, but not distance weights. The passing isn't working at all for the Delaunay-family weights:
print(weights.Delaunay.from_dataframe(south.set_geometry(south.centroid), ids="FIPS").id_order[:5])
[0, 1, 2, 3, 4, 5]
print(weights.Gabriel.from_dataframe(south.set_geometry(south.centroid), ids="FIPS").id_order[:5])
[0, 1, 2, 3, 4, 5]
And this fails entirely, since the columns aren't propagated down to the FIPS
weights.Voronoi(south.set_geometry(south.centroid), ids='FIPS')
because columns aren't correctly propagated through to the Rook.from_dataframe() call.
@ljwolf whoops. I'll have a look at Delaunay/Voronoi stuff later unless you want to commit some changes yourself (which may be faster).
some odd behavior around weights ordering here
Can we solve the ordering issue in another PR and have this focused on passing the ids through? Since this hasn't changed from what we have on master.
Sure! I'm drafting a fix for these inconsistencies now, will send
I've sent a PR upstream (martinfleis/libpysal#35), since I'm not sure if this counts as "fixing" sorting by default.... What I'm referring to is this:
.from_dataframe() methods should always provide weights that are aligned to the input dataframe.
The current state of things is:
- contiguity weights'
.from_dataframe()will break the ordering - distance weights'
.from_dataframe()does not break the ordering - With martinfleis/libpysal#35,
Delaunay-based weights will not break the ordering.
I'd encourage us to not break order in from_dataframe(), which means that we consider the input to from_dataframe as having a "set" id_order, rather than sorting it.
The order should now be preserved for contiguity as well.
I think I'll add some tests for this before merging.
Should be ready now.
@martinfleis should we merge this?
Unless there are any other comments it is ready to merge, yes.