As a follow-up of the discussion we had yesterday during the dev call, I am starting this PR more as a place for a discussion on top of a code rather than an actual code change.

I have created a from_dataframe_new method to match the ideal outcome of the transition from ids, idVariable and id_order to a cleaner API. What needs to be figured out is a transitioning phase.

Looking at the code, I also think that @ljwolf's suggestion to deprecate id_order should be done at the same time as the rest of the changes as these are quite closely linked together.

There is one question we haven't discussed though. What shall we do with the actual dataframe index? We currently completely ignore it and when no ids are passed use positional indexing to encode weights. But the most sensible default would be to use the dataframe index instead imho. However, this is quite a big breaking change, so I am not sure what would be the best on that front. edit: the docstring now says If nothing is provided, the dataframe index is used but the reality does not follow that.

The API change proposed here on the example of Rook should be then mirrored to all relevant from_dataframe methods. In some cases (KNN, DistanceBand), it involves nearly no change as ids behave as we want it and there is no idVariable. So the main goal here is to ensure the API is consistent.

Relevant discussions: #473, #183, #184 (which does essentially the same thing as this PR), #102

Sep 03 '22 11:09 martinfleis

Codecov Report

Merging #477 (c796673) into master (d291e01) will increase coverage by 0.0%. The diff coverage is 69.9%.

@@          Coverage Diff           @@
##           master    #477   +/-   ##
======================================
  Coverage    78.8%   78.8%           
======================================
  Files         122     122           
  Lines       13100   13161   +61     
======================================
+ Hits        10320   10371   +51     
- Misses       2780    2790   +10

Impacted Files	Coverage Δ
libpysal/weights/contiguity.py	`77.2% <64.4%> (-1.9%)`	:arrow_down:
libpysal/weights/gabriel.py	`47.1% <65.2%> (+3.3%)`	:arrow_up:
libpysal/weights/distance.py	`85.4% <77.8%> (ø)`
libpysal/weights/tests/test_contiguity.py	`95.1% <100.0%> (+0.3%)`	:arrow_up:
libpysal/_version.py	`40.7% <0.0%> (+2.7%)`	:arrow_up:

Sep 03 '22 11:09 codecov[bot]

I have created a from_dataframe_new method to match the ideal outcome of the transition from ids, idVariable and id_order to a cleaner API. What needs to be figured out is a transitioning phase.

So the plan would be to deprecate idVariable and id_order, while refactoring ids for a new api, correct?

If yes, I can start exploring deprecation.

Sep 08 '22 22:09 sjsrey

On a related note, turning off the sorting of the ids on init can be done. I explored this https://github.com/pysal/libpysal/commit/13c89b68038b5cf2b4bcf97abe1f14162b947ac0 and two tests broke in all of weights. These had to do with block_weights in both cases. I updated the doctests and unittests to correct for the case of turning off sort on init https://github.com/pysal/libpysal/commit/fc0f8b6ee2545d73a3727c14f06ba4d571838ce5. So removing the behavior of sorting on init could be done in the no sort branch.

Sep 09 '22 03:09 sjsrey

So the plan would be to deprecate idVariable and id_order, while refactoring ids for a new api, correct?

Yes. And figure out what to do with positional index vs the dataframe index of when nothing else is passed.

Sep 09 '22 14:09 martinfleis

So the plan would be to deprecate idVariable and id_order, while refactoring ids for a new api, correct?

Yes. And figure out what to do with positional index vs the dataframe index of when nothing else is passed.

Since 3.6, dicts order the keys based on insertion order, so the positional index for our w would be w.neighbors.keys(). Prior to 3.6 we had to do all the id_sort business to support alignment between w and the arrays they would be applied to (in spatial lags).

So one thought is the new w.ids could be a property with a getter that returns w.neighborhs.keys().

Users could then sort the array to align with the w.ids order and then pass the sorted array into lag_spatial or we provide a way to reorder the w.ids and pass that into the lag_spatial function with the original (unsorted) array.

Sep 10 '22 00:09 sjsrey

The timing for the changes might be something like this:

For meta 23.01 released Jan 2023, deprecation warnings have been added for the old id-related api.

We then target 23.07 (July release) for the new api.

Maintaining two api's seems like a painful route, so the above is suggested as a way to avoid this.

What do devs think?

Sep 19 '22 23:09 sjsrey

Not sure I fully understand. Esp. this

Maintaining two api's seems like a painful route, so the above is suggested as a way to avoid this.

I agree that we should not have two APIs but we need a transition period. We cannot just raise a warning in January and make the change in June. Everything breaks in June if you don't give a way of adjusting to the new API in January.

Sep 20 '22 15:09 martinfleis

Not sure I fully understand. Esp. this

Maintaining two api's seems like a painful route, so the above is suggested as a way to avoid this.

I agree that we should not have two APIs but we need a transition period. We cannot just raise a warning in January and make the change in June. Everything breaks in June if you don't give a way of adjusting to the new API in January.

What I am trying to understand is how we go about this. Given the nature of the changes we are proposing, I can't see a way (yet) where say; [1] we have deprecation warnings around ids and id_order which would permit legacy code to work, but alert the user to upcoming changes; and [2] also allow for the new functionality of ids in order to point the user to the functionality to adopt.

Its easier if we have a mapping of: old_property -> new_property with different names. For ids we don't have this to leverage as its ids -> ids.

Maybe this suggests a different name for the property/method in the new API? old is ids new maybe something like iloc and loc?

The other problem is that if we deprecate idVariable and the sort of ids on init, then that functionality has to coexist peacefully with the new API (which we agree should not sort on init).

Fleshing all this out here is really helpful :->

Sep 20 '22 22:09 sjsrey

We don't have to deprecate anything around ids, we just make that variable smarter to consume either string representing the columns name, or any array-like of the matching length. That is fully backwards compatible and can be done immediately.

At the same time, we can raise a deprecation warning on idVariable but pass its value temporarily to enhanced ids.

The id_order can now raise a warning and since we want to remove it without a replacement, that is it.

One thing I still don't know how to approach is the issue of index vs positional index. We currently claim in the docstring that index is used automatically but it is not. If we just start doing that instead of using the position (iloc), but may break some things (e.g. it would break small bits in momepy). So I would say we should add a new keyword to control that behaviour.

Something like this

    @classmethod
    def from_dataframe(
        cls, df, geom_col=None, idVariable=None, ids=None, id_order=None, use_index=None, **kwargs
    ):
        """
        Construct a weights object from a pandas dataframe with a geometry
        column. This will cast the polygons to PySAL polygons, then build the W
        using ids from the dataframe.

        Parameters
        ----------
        df          : DataFrame
                      a :class: `pandas.DataFrame` containing geometries to use
                      for spatial weights
        geom_col    : string
                      the name of the column in `df` that contains the
                      geometries. Defaults to active geometry column.
        idVariable  : string
                      the name of the column to use as IDs. If nothing is
                      provided, the dataframe index is used DEPRECATED - use ids
        ids         : list
                      a list of ids to use to index the spatial weights object.
                      Order is not respected from this list.
        id_order    : list
                      an ordered list of ids to use to index the spatial weights
                      object. If used, the resulting weights object will iterate
                      over results in the order of the names provided in this
                      argument. DEPRECATED
        use_index   : bool
                      use index of `df` as ids to index the spatial weights object

        See Also
        --------
        :class:`libpysal.weights.weights.W`
        :class:`libpysal.weights.contiguity.Rook`
        """
        if geom_col is None:
            geom_col = df.geometry.name

        if id_order is not None:
            warnings.warn("deprecated, don't use")
            if id_order is True and ((idVariable is not None) or (ids is not None)):
                # if idVariable is None, we want ids. Otherwise, we want the
                # idVariable column
                id_order = list(df.get(idVariable, ids))
            else:
                id_order = df.get(id_order, ids)
        
        if idVariable is not None:
            if ids is None:
                warnings.warn("deprecated, use ids")
                ids = idVariable
            else:
                warnings.warn("both idVariable and ids passed, using ids")

        if ids is None:
            if use_index is None:
                warnings.warn("use_index defaults to False but will default to True in future. Set True/False directly to control this behaviour and silence this warning")
                use_index = False
            if use_index:
                ids = df.index.tolist()

        if isinstance(ids, str):
            ids = df[ids]

        if not isinstance(ids, list):
            ids = ids.tolist()


        return cls.from_iterable(
            df[geom_col].tolist(), ids=ids, id_order=id_order, **kwargs

This already supports all we want in the future but is backwards compatible and provides a deprecation period for transition.

Sep 21 '22 09:09 martinfleis

I have updated from_dataframe in Rook, and Queen, adding the deprecation and ensuring all is controllable (use of index or not) and matching the docstring. I have also updated from_dataframe in KNN, Kernel, and DistanceBand by adding use_index keyword to ensure consistency between all methods.

All changes are backwards compatible. In contiguity, we will want to change the default in use_index to True to match the default in distance weights (that has already been used).

@ljwolf how are Delaunay weights indexed? It would be good to have the same options there if possible.

Oct 28 '22 19:10 martinfleis

They only take input, and do no id processing iirc, so they create a range index. They're in weights.gabriel

Oct 28 '22 19:10 ljwolf

Okay, I added the same options to handle index to Delaunay.from_dataframe as well. The default on use_index will need to change there as well, the situation is the same as we have in contiguity right now.

Oct 28 '22 19:10 martinfleis

OK, still some odd behavior around weights ordering here, akin to the issues I outlined in the test bench from #184. I think that part of this fix has to be stopping sorting ids by default.

An example using FIPS codes from the south example data:

import geopandas
from libpysal import weights, examples
examples.load_example("south.shp") # I can't locate south if I don't load it first, even if it's downloaded?
south = geopandas.read_file(examples.get_path("south.shp"))
south_shuffled = south.sample(frac=1, replace=False)
print(weights.Rook.from_dataframe(south, ids="FIPS").id_order[:5])
['01001', '01003', '01005', '01007', '01009']
print(weights.DistanceBand.from_dataframe(south, threshold=2, ids="FIPS").id_order[:5])
['54029', '54009', '54069', '54051', '10003']

Sorting of the ids is happening for contiguity weights, but not distance weights. The passing isn't working at all for the Delaunay-family weights:

print(weights.Delaunay.from_dataframe(south.set_geometry(south.centroid), ids="FIPS").id_order[:5])
[0, 1, 2, 3, 4, 5]
print(weights.Gabriel.from_dataframe(south.set_geometry(south.centroid), ids="FIPS").id_order[:5])
[0, 1, 2, 3, 4, 5]

And this fails entirely, since the columns aren't propagated down to the FIPS

weights.Voronoi(south.set_geometry(south.centroid), ids='FIPS')

because columns aren't correctly propagated through to the Rook.from_dataframe() call.

Nov 03 '22 10:11 ljwolf

@ljwolf whoops. I'll have a look at Delaunay/Voronoi stuff later unless you want to commit some changes yourself (which may be faster).

some odd behavior around weights ordering here

Can we solve the ordering issue in another PR and have this focused on passing the ids through? Since this hasn't changed from what we have on master.

Nov 03 '22 11:11 martinfleis

Sure! I'm drafting a fix for these inconsistencies now, will send

Nov 03 '22 11:11 ljwolf

I've sent a PR upstream (martinfleis/libpysal#35), since I'm not sure if this counts as "fixing" sorting by default.... What I'm referring to is this:

.from_dataframe() methods should always provide weights that are aligned to the input dataframe.

The current state of things is:

contiguity weights' .from_dataframe() will break the ordering
distance weights' .from_dataframe() does not break the ordering
With martinfleis/libpysal#35, Delaunay-based weights will not break the ordering.

I'd encourage us to not break order in from_dataframe(), which means that we consider the input to from_dataframe as having a "set" id_order, rather than sorting it.

Nov 03 '22 12:11 ljwolf

The order should now be preserved for contiguity as well.

I think I'll add some tests for this before merging.

Nov 03 '22 13:11 martinfleis

Should be ready now.

Jan 01 '23 21:01 martinfleis

@martinfleis should we merge this?

Jan 06 '23 19:01 sjsrey

Unless there are any other comments it is ready to merge, yes.

Jan 06 '23 21:01 martinfleis

API: Change the handling of IDs in from_dataframe constructors

Codecov Report