geopandas
geopandas copied to clipboard
ENH: Add id_as_index argument to GeoDataFrame.from_features
Closes #1988, closes #1208.
I've set the default to False
so this does not surprise anybody in the future.
Probably an error or at least a warning should be raised if id_as_index=True
but there is no id in the features ?
Thanks for looking into this!
I can think of a few cases we need to take care of.
-
id_as_index=True
while noid
in the JSON (as you mention). In this case, we should probably just warn. - what if some features contain
id
while others do not? We should not set a column withNone
as an index.
Note: updating from master will fix CI.
Thanks for your feedback @martinfleis !
I went for the warning also for case 2, so this is coherent with the treatment of case 1.
Ahh, not yet. Can you adapt the example in docstring to match the actual output?
@gcaria @martinfleis There was actually some discussion on this way back in #1208; this PR will close that issue as well.
Thanks @jdmcbr, I have changed the initial message to reflect this (both issues are now linked to this PR).
A question here: what happens if your GeoDataFrame has a "id" column?
It seems that then this is preserved in the "properties" of the GeoJSON, but with this PR those values will be overridden by the "id"?
On master you get:
import geopandas
from shapely.geometry import Point
df = geopandas.GeoDataFrame({'col': [0.1, 0.2], 'id': [1, 2], "geometry": [Point(1, 1), Point(2, 2)]})
In [3]: df.__geo_interface__
Out[3]:
{'type': 'FeatureCollection',
'features': [{'id': '0',
'type': 'Feature',
'properties': {'col': 0.1, 'id': 1},
'geometry': {'type': 'Point', 'coordinates': (1.0, 1.0)},
'bbox': (1.0, 1.0, 1.0, 1.0)},
{'id': '1',
'type': 'Feature',
'properties': {'col': 0.2, 'id': 2},
'geometry': {'type': 'Point', 'coordinates': (2.0, 2.0)},
'bbox': (2.0, 2.0, 2.0, 2.0)}],
'bbox': (1.0, 1.0, 2.0, 2.0)}
In [4]: geopandas.GeoDataFrame.from_features(df.__geo_interface__)
Out[4]:
geometry col id
0 POINT (1.00000 1.00000) 0.1 1
1 POINT (2.00000 2.00000) 0.2 2
So the "id" column reflects the values in the "properties" field, and not in the "id" field. While with this PR, I think that column will have values [0, 1] from the "id" field.
Good catch @jorisvandenbossche !
So ideally when the user calls from_features
and sees an id
column in the output, the data contained there could be:
- the feature IDs when there is no
id
in the properties dict - the
id
from the properties dict if it exists (in which case the feature IDs are simply ignored, unlessid_as_index
is True)
I guess case 2) deserves at least a warning.
Hey, any news on the state of this feature?
Hey, any news on the state of this feature?
It is scheduled for the 0.11 release that will come out early next year.
Thanks for developing this improvement. I have some comments.
As mentioned above, the identifier named id
may collide with a property named id
. As a general approach, I suggest to always parse the ids, maybe to an _id
column or just id
if we want to risk the clash. The best is probably to a column name that can be specified by the user with default value id
. Anyway, save it to a column, always and by default, but not turn in automatically into an index.
The boolean parameter, that we could simply call index=True
and not id_as_index
, would then convert the column to index as last step for user convenience. In this case we would use pandas' reindex capability (that would happen anyway under the hood) that raises an exception or warns in case the index has no values or duplicate values, that we don't check by the way.
It would be more robust and would reuse existing features of pandas.
@gcaria , I think the logic in your path is too convoluted, it will be difficult to explain what to expect. I would try to approach it as I wrote in my last comment, that doesn't involve column renames or guessing.
As I'll need to use this feature (that's why I asked help here), I know already in advance I'll have to deal with heterogeneous geofeatures, some with id set, some not, ... I, as user, would need some reliable capability to just extract the id to its own column. Using it as index would be functionality on top, that I might not need now, or implement myself using functions of pandas. That would be fine, as long as just parsing the id in its own column is fast and reliable.
I think the code of a package can not adapt to the specific needs of a given user, but instead it should be usable by the largest number of users.
I think the logic in your path is too convoluted, it will be difficult to explain what to expect.
If you think there is a case where this code does not behave as expected, could you point out exactly what that case is and provide some minimal reproducible example?
I would try to approach it as I wrote in my last comment, that doesn't involve column renames or guessing.
Since you seem to have a clear idea of how the code should look like, could you share the actual lines of code with us by suggesting changes in this PR?
I'm sorry that my comment was negatively perceived.
About me as single user, I'm the one that suggested this improvement based on real needs, I may not represent all the users of the world, but I represent one that is better than zero. From my point of view, it could be difficult to use it the way it is proposed, because behavior is not fully predictable (see _maybe_use_feature_id
, and feature_id
overriding possible property with the same name). There's also no possibility to read the id
without using it as index.
About the second point, I didn't say there's a bug or that the code misbehaves. I just said that what the function tries to do will be difficult to explain to users, e.g. when the id is used as index and when not. I also said code is convoluted, a subjective comment you can of course happily ignore as long as the code does what is supposed to do properly.
Anyway, I share your frustration if people criticize and do nothing, therefore I'll prepare an independent pull request or contribute to this. I may need some time, though, as I don't have a development environment ready for that.
Thanks @ragazzojp for your feedback. A discussion is often appropriate, but it is always useful to write some code or a MRE to convey efficiently the message, hence my call to action.
There's also no possibility to read the id without using it as index.
Actually there is such a possibility (also before my very last commit).
see feature_id overriding possible property with the same name
This is a good catch, I have changed the code to avoid this potential problem.
I don't see how the code is complicated: apart for the part that deals with issuing warnings, this commits adds very few lines to geopandas/geodataframe.py
. Anyway, the most important part is to cover these new functionalities with tests, which I have done only partially so far.
I am still following the logic that I explained here, but at this stage I'll wait for some feedback from the geopandas members to see what they think.