gdal icon indicating copy to clipboard operation
gdal copied to clipboard

Suppor Null/NA/Empty Geom codification on SHP to handle Arcgis behavior

Open latot opened this issue 6 months ago • 20 comments

Feature description

Hi all! long time no see! I has been using shp, and I fall in a issue, this is not directly related to the driver, but is an issue related to the ecosystem.

When we use SHP + Argis, we have the issue that Arcgis replaces all NaN (float) values to NULL, and after all NULL values are transformed to 0.... this has been a hard point in any workflow, even if we have gpkg, still a lot of ppl uses SHP with Arcgis, and when we want to share a db or a table, we need to manually transform this values to a code, and when to load a file replace the codes with the right type (NULL or NA).

I think would be nice have an option in SHP, where we can specific codes for NaN and NULL to go forward and back in the data and tables.

There is also other similar issue, Empty Geometries, Arcgis also transform this, removing the rows! so when we want to transform something to SHP to be used in Arcgis, would be nice have for example a Point(-9999, -9999).

No idea if we can specific in some part in the metadata put this code to automate this even more.

Thx!

Additional context

To clarify what I could expect (just as a brute example):

Gen file to work with arcgis

ogr2ogr arcgis_data.shp data.shp -lco NAN_CODE=-999 -lco NA_CODE=-9999 -lco GEOM_CODE=-9999 -lco MODE=TO_ARCGIS

Recover a valid spatial file

ogr2ogr data.shp arcgis_data.shp -lco NAN_CODE=-999 -lco NA_CODE=-9999 -lco GEOM_CODE=-9999 -lco MODE=FROM_ARCGIS

latot avatar Jun 10 '25 18:06 latot

GDAL seems to support NaN and NULL features and empty geometries in SHP:

import os.path
from osgeo import gdal, ogr

ogr.UseExceptions()

driver = ogr.GetDriverByName("ESRI Shapefile")
path = "test.shp"

if os.path.exists(path):
    driver.Delete(path)

datasource = driver.Create(path, 0, 0, 0, gdal.GDT_Unknown)
layer = datasource.CreateLayer("layer", None, ogr.wkbPoint)
field_defn = ogr.FieldDefn("value", ogr.OFTReal)
layer.CreateField(field_defn)

feat = ogr.Feature(layer.GetLayerDefn())
feat.SetField(0, float("nan"))
layer.CreateFeature(feat)

feat = ogr.Feature(layer.GetLayerDefn())
layer.CreateFeature(feat)
value: Real (24.15)
OGRFeature(test):0
  value (Real) = nan
  POINT (0 0)

OGRFeature(test):1
  value (Real) = (null)
  POINT (1 1)

lnicola avatar Jun 10 '25 18:06 lnicola

Yes, you are right about that, this is not a driver issue nor specification feature.

All this problems have one origin, this is what Arcgis does to any SHP.... is not SHP is Arcgis doing its own stuff, no idea why they decided to pick this like this.

This Arcgis feature breaks most of workflows with databases, ecosystem and human workflow, I have known ppl who do spatial things on Arcgis, and ppl who know more of that ppl, and basically, that ppl is not used to "change", so is very hard to make them use GPKG for example, yes there is no issue if you use GPKG.

Try something like "lets use GPKG or QGIS" is just too much for a lot of ppl who is just used to Arcgis and SHP... I'm not the only one who have tried this in this context....

So, I think instead of fight or try to change ppl, be able to support this on GDAL it self, could really simplify a lot the workflows and interactions with other systems.

latot avatar Jun 10 '25 18:06 latot

So if ArcGIS converts NaN to NULL, how can GDAL reverse that? And who replaces NULL with 0?

lnicola avatar Jun 10 '25 18:06 lnicola

Basically, in case we want to export something for ppl or arcgis systems, we can do this:

Transform any NaN to -999 Transform any NULL to -9999 Transform any empty geometry to Point(-9999, -9999) (obvs check the code is not in use)

Then the result file is used for Arcgis ppl, when we want to upload or recover a valid file, we do the opposite, pick the codes and replace with the right values.

ArcGis is also the one who replaces NULL with 0... that really break a lot of our db for example, we lost track of data we needed to collect (which had NULL) from the ones that their value is 0.

latot avatar Jun 10 '25 18:06 latot

I feels stupid, but I have no idea about what "ppl" means.

jratike80 avatar Jun 10 '25 23:06 jratike80

Null shapes are supported in shapefiles https://www.esri.com/content/dam/esrisites/sitecore-archive/Files/Pdfs/library/whitepapers/pdfs/shapefile.pdf

A shape type of 0 indicates a null shape, with no geometric data for the shape. Each feature type (point, line, polygon, etc.) supports nullsit is valid to have points and null points in the same shapefile

If ArcGIS does not support null shapes in a shapefile, contact your ESRI dealer and try to get a bug report accepted.

For attributes, shapefile is using dBase III (or dBase IV) format. As far as I understand they do not support NaNs not NULLs. I am not sure if this https://github.com/OSGeo/gdal/issues/2486 is related.

Do you feel bad if I will close this ticket soon as something like "not planned" or "won't fix"?

jratike80 avatar Jun 11 '25 00:06 jratike80

Hi "ppl" is people, sadly I already contacted ESRI, this is not going to have a fix......

I thought be able to handle this behavior would be nice from GDAL, which tries to handle different drivers and circumstances closer to them...

I know this is really out of the driver it self, but also seems the best place to handle this from the ecosystem perspective.

As a note, I have used SHP with R/Python using NULL and empty geometries without any issue. IIRC some Esri Servers also uses this, which really makes hard read their data, each server or layer can have its own codes.

latot avatar Jun 11 '25 13:06 latot

@jratike80

Do you feel bad if I will close this ticket soon as something like "not planned" or "won't fix"?

Well, yes, a little.

  • We can't change Arcgis behavior from tickets
  • We are used to change, specially coding ppl, but spatial ppl who do not code their culture is not to change, and they learn to use SHP + Arcgis from universities, so change Arcgis or SHP is a high organization level change.
  • Even of we change it, a lot of institutions and enterprises still request "only SHP" and they also "uses Arcgis", this is really a big cultural change

My personal feeling, I like open source things, I try to use and promote them, and this issue has been one of the hard to handle in the workflow, we can't escape form ppl using SHP + Arcgis rn, but I also think is good be able to move directly data from the public sector to the open source one, and in the other direction, most of the time while we work with others we need both.

But I understand that this is very specific, Argis + Arcgis Server/SHP, which do not go directly to GDAL, but GDAL as one of the most popular, useful tool to interact, load and handle data from several sources and destinations makes it perfect to simplify this, which happens in the spatial industry.

I don't know if the lco option is the best one, nor if there is a easy way to put something like this on GDAL, @lnicola said in matrix that this could be a very big change, so evaluate the effort is also a good point.

Thx to read, and do not close this issue fast, so we could/can talk about it :D

latot avatar Jun 11 '25 15:06 latot

I may understand wrong, but it seems that you are trying to use shapefiles (and dBase III format for the .dbf part) for something that the format does not support natively. And as a solution you suggest to add something rather complex into the GDAL shapefile driver. I suppose that despite you there are not many users who would benefit from that change.

I would really consider using other formats which support natively your requirements. That is a solution that would really simplify the workflows and interactions with other systems. Alternatively, if your data are in a database, you could create a view that maps NaNs and NULLs into your favorite numbers. Or you can do that on-the-fly with GDAL tools and -sql. Something like the following (untested)

-sql "select geometry, CASE WHEN attibute=NaN THEN -999 WHEN attribute=null THEN -9999 ELSE attribute END AS attribute from my_table"

jratike80 avatar Jun 11 '25 16:06 jratike80

-sql was my suggestion, though it might be hard to use for NaNs. Shapefiles do support NULLs natively, but it sounds like ArcGIS doesn't, so latot wants a built-in workaround for that.

lnicola avatar Jun 11 '25 16:06 lnicola

ESRI believes that shapefiles do not support NULL attributes https://desktop.arcgis.com/en/arcmap/latest/manage-data/shapefiles/geoprocessing-considerations-for-shapefile-output.htm

jratike80 avatar Jun 11 '25 16:06 jratike80

Yeah, they don't support NULL in shapefiles, they suggest GDB for that use case.

lnicola avatar Jun 11 '25 16:06 lnicola

Right, is just culter hits hard here, I'm not the only one who have tried to change this at organization level, but the ppl who uses spatial data and do not code is not flexible with this as I wrote in https://github.com/OSGeo/gdal/issues/12552#issuecomment-2963184305 , We have tested and proposed several times, but due to what is wrote there, is not an easy at all change the SHP for any other format, still other institutions forces you to use it, so is not only the teamwork flow which hits here.

The query part... is not easy, not just for the query, is also because the use of gdal and spatial files are not exclusive of one file, this hits entire databases and how to you port them, the usual workflow is...

get a spatial file -> Work -> You noticed some values are wrong when you use it -> you notice there is -9999 -> you remember or discover someone used arcgis or this comes from arcgis/arcgis server -> fix the column in the file -> continue working until discover this again

Is usual for most ppl do and handle this manually, I'm surprised this was not known here, I have meet other ppl who also uses tools that uses GDAL and also hits with this, is just all the teams do the work to replace it when they noticed it, the major issue is not be able to know if a file is using encoding or is broken, or only some columns was replaced manually.... Maybe because this happens a lot more in places where ESRI is present, and seems there usually not much code ppl, happens to this issue never reached GDAL as a proposal, or no one had this idea.

I think the most comfy solution for this workflow is GDAL, because it handle formats and most behaviors in the workflows. But at the same time, there can be other solutions, and make other tool to handle from a external script, as a pre or post process when we want to export/import data. Which is particularity... almost the same as GDAL does... come to think of, this seems that Arcgis in their limited support of SHP, basically they created a sub-driver for SHP, where the culture just filled the holes with -9999.... (still remember, that ppl will not change SHP nor Arcgis, hey can't by their culture or they can't due to request from other places).

latot avatar Jun 11 '25 19:06 latot

I'm echoing others skepticism about the need to add hacks in the Shapefile driver. GDAL has full read/write support for GeoPackage and FileGeodatabase, the latest being the premium format fof ArcGIS, so that should be the way to go

rouault avatar Jun 11 '25 19:06 rouault

Not sure if it makes it better or worse, but latot probably wants this in the ogr2ogr core (like the field type args), so it works with every format.

lnicola avatar Jun 11 '25 19:06 lnicola

Not sure if it makes it better or worse, but latot probably wants this in the ogr2ogr core (like the field type args), so it works with every format.

We probably need a more general "gdal vector field-transform" algorithm where parts of the operations described in this ticket could be done. ogr2ogr feature set is now frozen :-)

rouault avatar Jun 11 '25 19:06 rouault

Hi @rouault right! that is also a nice option! if there is a simple way to perform the go and back to do this, I would also think of this as solved.

latot avatar Jun 11 '25 19:06 latot

Field-transform algorithm could make sense and be more usuful than a shape specific NaN-mapper. Perhaps the algorithm could use a mapping table with columns table-field-input_value-output_value on each row. Perhaps table and field could accept a wildcard *. And input and output might support simple expressions like lower or input_value/1000. Something like in this article https://medium.com/@nripapathak/field-mapping-file-in-etl-in-better-design-c7d48619a6e9

But the result would be so much ETL that I wonder if there happens to be already some simple open source geospatial ETL tool that can do the task.

jratike80 avatar Jun 11 '25 20:06 jratike80

mm, I think to reach this issue would be something like the ETL but instead of file names use filed types for the conversions.

This would also need something like an assert, to check that the codes to be used are not in use.

latot avatar Jun 11 '25 20:06 latot

rn... I think an external file to be used is.... complex, in one side is nice, be able to have it, descriptions of field (shp have a 10 character field limit), save the codes, etc, etc. The complex part is that no app would use it, so is very easy to them be without sync, in that case would be necessary a validation step, check the actual file is coherent with the table.

latot avatar Jun 11 '25 21:06 latot