geopandas icon indicating copy to clipboard operation
geopandas copied to clipboard

ENH: Add id_as_index argument to GeoDataFrame.from_features

Open gcaria opened this issue 3 years ago • 14 comments

Closes #1988, closes #1208.

I've set the default to False so this does not surprise anybody in the future. Probably an error or at least a warning should be raised if id_as_index=True but there is no id in the features ?

gcaria avatar Jul 12 '21 16:07 gcaria

Thanks for looking into this!

I can think of a few cases we need to take care of.

  1. id_as_index=True while no id in the JSON (as you mention). In this case, we should probably just warn.
  2. what if some features contain id while others do not? We should not set a column with None as an index.

Note: updating from master will fix CI.

martinfleis avatar Jul 24 '21 10:07 martinfleis

Thanks for your feedback @martinfleis !

I went for the warning also for case 2, so this is coherent with the treatment of case 1.

gcaria avatar Jul 25 '21 19:07 gcaria

Ahh, not yet. Can you adapt the example in docstring to match the actual output?

martinfleis avatar Aug 08 '21 09:08 martinfleis

@gcaria @martinfleis There was actually some discussion on this way back in #1208; this PR will close that issue as well.

jdmcbr avatar Aug 09 '21 03:08 jdmcbr

Thanks @jdmcbr, I have changed the initial message to reflect this (both issues are now linked to this PR).

gcaria avatar Aug 10 '21 06:08 gcaria

A question here: what happens if your GeoDataFrame has a "id" column?

It seems that then this is preserved in the "properties" of the GeoJSON, but with this PR those values will be overridden by the "id"?

On master you get:

import geopandas
from shapely.geometry import Point

df = geopandas.GeoDataFrame({'col': [0.1, 0.2], 'id': [1, 2], "geometry": [Point(1, 1), Point(2, 2)]})

In [3]: df.__geo_interface__
Out[3]: 
{'type': 'FeatureCollection',
 'features': [{'id': '0',
   'type': 'Feature',
   'properties': {'col': 0.1, 'id': 1},
   'geometry': {'type': 'Point', 'coordinates': (1.0, 1.0)},
   'bbox': (1.0, 1.0, 1.0, 1.0)},
  {'id': '1',
   'type': 'Feature',
   'properties': {'col': 0.2, 'id': 2},
   'geometry': {'type': 'Point', 'coordinates': (2.0, 2.0)},
   'bbox': (2.0, 2.0, 2.0, 2.0)}],
 'bbox': (1.0, 1.0, 2.0, 2.0)}

In [4]: geopandas.GeoDataFrame.from_features(df.__geo_interface__)
Out[4]: 
                  geometry  col  id
0  POINT (1.00000 1.00000)  0.1   1
1  POINT (2.00000 2.00000)  0.2   2

So the "id" column reflects the values in the "properties" field, and not in the "id" field. While with this PR, I think that column will have values [0, 1] from the "id" field.

jorisvandenbossche avatar Sep 25 '21 10:09 jorisvandenbossche

Good catch @jorisvandenbossche !

So ideally when the user calls from_features and sees an id column in the output, the data contained there could be:

  1. the feature IDs when there is no id in the properties dict
  2. the id from the properties dict if it exists (in which case the feature IDs are simply ignored, unless id_as_index is True)

I guess case 2) deserves at least a warning.

gcaria avatar Oct 04 '21 12:10 gcaria

Hey, any news on the state of this feature?

NickLucche avatar Oct 18 '21 12:10 NickLucche

Hey, any news on the state of this feature?

It is scheduled for the 0.11 release that will come out early next year.

martinfleis avatar Oct 18 '21 12:10 martinfleis

Thanks for developing this improvement. I have some comments.

As mentioned above, the identifier named id may collide with a property named id. As a general approach, I suggest to always parse the ids, maybe to an _id column or just id if we want to risk the clash. The best is probably to a column name that can be specified by the user with default value id. Anyway, save it to a column, always and by default, but not turn in automatically into an index.

The boolean parameter, that we could simply call index=True and not id_as_index, would then convert the column to index as last step for user convenience. In this case we would use pandas' reindex capability (that would happen anyway under the hood) that raises an exception or warns in case the index has no values or duplicate values, that we don't check by the way.

It would be more robust and would reuse existing features of pandas.

stefano-xy avatar Oct 22 '21 08:10 stefano-xy

@gcaria , I think the logic in your path is too convoluted, it will be difficult to explain what to expect. I would try to approach it as I wrote in my last comment, that doesn't involve column renames or guessing.

As I'll need to use this feature (that's why I asked help here), I know already in advance I'll have to deal with heterogeneous geofeatures, some with id set, some not, ... I, as user, would need some reliable capability to just extract the id to its own column. Using it as index would be functionality on top, that I might not need now, or implement myself using functions of pandas. That would be fine, as long as just parsing the id in its own column is fast and reliable.

stefano-xy avatar Oct 28 '21 13:10 stefano-xy

I think the code of a package can not adapt to the specific needs of a given user, but instead it should be usable by the largest number of users.

I think the logic in your path is too convoluted, it will be difficult to explain what to expect.

If you think there is a case where this code does not behave as expected, could you point out exactly what that case is and provide some minimal reproducible example?

I would try to approach it as I wrote in my last comment, that doesn't involve column renames or guessing.

Since you seem to have a clear idea of how the code should look like, could you share the actual lines of code with us by suggesting changes in this PR?

gcaria avatar Oct 28 '21 17:10 gcaria

I'm sorry that my comment was negatively perceived.

About me as single user, I'm the one that suggested this improvement based on real needs, I may not represent all the users of the world, but I represent one that is better than zero. From my point of view, it could be difficult to use it the way it is proposed, because behavior is not fully predictable (see _maybe_use_feature_id, and feature_id overriding possible property with the same name). There's also no possibility to read the id without using it as index.

About the second point, I didn't say there's a bug or that the code misbehaves. I just said that what the function tries to do will be difficult to explain to users, e.g. when the id is used as index and when not. I also said code is convoluted, a subjective comment you can of course happily ignore as long as the code does what is supposed to do properly.

Anyway, I share your frustration if people criticize and do nothing, therefore I'll prepare an independent pull request or contribute to this. I may need some time, though, as I don't have a development environment ready for that.

stefano-xy avatar Oct 28 '21 20:10 stefano-xy

Thanks @ragazzojp for your feedback. A discussion is often appropriate, but it is always useful to write some code or a MRE to convey efficiently the message, hence my call to action.

There's also no possibility to read the id without using it as index.

Actually there is such a possibility (also before my very last commit).

see feature_id overriding possible property with the same name

This is a good catch, I have changed the code to avoid this potential problem.

I don't see how the code is complicated: apart for the part that deals with issuing warnings, this commits adds very few lines to geopandas/geodataframe.py. Anyway, the most important part is to cover these new functionalities with tests, which I have done only partially so far.

I am still following the logic that I explained here, but at this stage I'll wait for some feedback from the geopandas members to see what they think.

gcaria avatar Oct 28 '21 22:10 gcaria