arcgis-python-api
arcgis-python-api copied to clipboard
GeoAccessor.compare incorrectly shows every row as modified
Describe the bug
When using compare()
to compare Spatially Enabled DataFrames that were loaded from different sources (one from an ArcGIS online layer and one from a geodatabase feature class), I'm getting a result showing that every single row has been modified, when in reality they should be identical.
To Reproduce
My scenario is a bit involved to reproduce, since it seems to be caused by loading the DataFrames from different kinds of sources. However, one thing I noticed was that the GeoSeriesAccessor.JSON
attributes of both DataFrames were equal, whereas calling the Pandas to_json()
method showed small floating point differences in the numbers in the geometry field. The code below reproduces that behavior, by starting out with geometries that have those small floating point differences I was seeing:
Expand example code
import pandas as pd
from arcgis.features import GeoAccessor, GeoSeriesAccessor
line1 = [[[-7813752.429099999, 5438372.717100002], [-7813761.6741, 5438384.995700002]]]
line2 = [[[-7813752.4291, 5438372.7171], [-7813761.6741, 5438384.9957]]]
spatial_reference = {"wkid": 102100, "latestWkid": 3857}
df1 = pd.DataFrame(
[
{
"unique_column": 0,
"SHAPE": {"paths": line1, "spatialReference": spatial_reference},
}
]
)
df2 = pd.DataFrame(
[
{
"unique_column": 0,
"SHAPE": {"paths": line2, "spatialReference": spatial_reference},
}
]
)
print(df1[df1.spatial.name][0].JSON)
# {"paths":[[[-7813752.4290999994,5438372.7171000019],[-7813761.6741000004,5438384.9957000017]]],"spatialReference":{"wkid":102100,"latestWkid":3857}}
print(df2[df2.spatial.name][0].JSON)
# {"paths":[[[-7813752.4290999994,5438372.7171000019],[-7813761.6741000004,5438384.9957000017]]],"spatialReference":{"wkid":102100,"latestWkid":3857}}
print(df1.SHAPE.to_json())
# {"0":{"paths":[[[-7813752.4290999994,5438372.7171000019],[-7813761.6741000004,5438384.9957000017]]],"spatialReference":{"wkid":102100,"latestWkid":3857}}}
print(df2.SHAPE.to_json())
# {"0":{"paths":[[[-7813752.4291000003,5438372.7171],[-7813761.6741000004,5438384.9956999999]]],"spatialReference":{"wkid":102100,"latestWkid":3857}}}
print(df1[df1.spatial.name][0].JSON == df2[df2.spatial.name][0].JSON)
# True
print(df1[df1.spatial.name][0].equals(df2[df2.spatial.name][0]))
# True
print(df1.SHAPE.to_json() == df2.SHAPE.to_json())
# False
print(df1.spatial.compare(df2, match_field="unique_column")["modified_rows"])
# unique_column SHAPE
# 0 0 {"paths": [[[-7813752.4291, 5438372.7171], [-7...
Expected behavior
Comparing DataFrames which are loaded from different copies of the same dataset should not show every row as modified. In particular, if GeoSeriesAccessor.equals()
returns True
when comparing two rows' geometries, as in the example above, those geometries should not cause the row to be listed as modified in the comparison result.
If there's no way to avoid issues with floating points numbers in geometry comparisons, perhaps the ability to specify a spatial tolerance for the comparison could be added.
Platform:
- OS: Windows 10
- Python API version: 2.2.0.1
@skykasko
How are you getting the JSON for the GeoAccessor? df1.spatial.json
?
The compare method uses a panda's merge method so if you are seeing difference in the panda's to_json then that would explain the fact that difference are found when merging.
I would do some data manipulation on the data and make sure they are rounded to the same decimal place. It seems the issue might arise to how the data is stored in Online vs the other source.
@achapkowski do you have any input?
How are you getting the JSON for the GeoAccessor?
df1.spatial.json
?
df1[df1.spatial.name][0].JSON
. See the example above, where the expression df1[df1.spatial.name][0].JSON == df2[df2.spatial.name][0].JSON
evaluated to True
.
I would do some data manipulation on the data and make sure they are rounded to the same decimal place.
How would you recommend going about this? Calling df1.round(5)
has no effect on the shape column, and calling df1[df1.spatial.name].round(5)
raises the error AttributeError: 'GeoArray' object has no attribute 'round'
.
I suppose I could apply a custom function to the shape column that iterates over all the coordinate values and rounds them one by one. For my example above, the following works:
def round_shape(shape):
roundedPaths = []
for path in shape["paths"]:
roundedPaths.append([[round(coord, 5) for coord in point] for point in path])
return {"paths": roundedPaths, "spatialReference": shape["spatialReference"]}
df1[df1.spatial.name] = df1[df1.spatial.name].apply(round_shape)
# Manipulating the geometry column appears to make the DataFrame no longer
# recognized as Spatially Enabled, so the geometry column needs to be reassigned.
df1.spatial.set_geometry("SHAPE")
But this could be tricky to get right for every possible geometry type, and at that point using GeoAccessor.compare()
may be more complicated than writing a custom comparison function that uses GeoSeriesAccessor.equals()
to check for geometry differences and Pandas' built-in DataFrame.compare()
to check for other differences.
@skykasko
Creating a custom round method is the way to go. Otherwise you can look into other libraries that have some methods such as almost_equals
in the Shapely library or if you have arcpy you can use the generalize operation or boundary.
https://stackoverflow.com/questions/63402333/how-does-almost-equals-function-of-shapely-treat-the-starting-point-and-errors https://pro.arcgis.com/en/pro-app/latest/tool-reference/editing/generalize.htm
You can also post on the Esri Community forum to see if other users have come across this or have suggestions.