HUGE_VAL vs NAN
Hi,
I only know PROJ through pyproj, but as far as I can see PROJ uses HUGE_VAL (that means +Infinity) from math.h as a fill value whenever a projection operation fails. This is also what pyproj returns in these cases.
I would like to ask whether the usage of HUGE_VAL is intentional, or if it could make sense to use NAN (also from math.h) instead. In my opinion this would have two advantages:
- Selecting data points based on their coordinate value would be more intuitive. For example selecting data points with
longitude > 0currently also includes those withlongitude=HUGE_VAL(e.g. space pixels in the geostationary projection). - In the xarray ecosystem
NaNhas become the standard fill value. Arithmetic withinfvalues is different, so there is a risk for undesired behaviour if users are not aware of their presence.
Of course it would be possible to discuss this somewhere downstream in pyproj or pyresample (which is what I'm working with). But pyproj probably wants to be coherent with PROJ, and pyresample strives to stay coherent with pyproj, so I thought I'd ask here first.
Cheers, Stephan
Using HUGE_VAL has been what PROJ has done for 30 years. Changing that would certainly break downstream users, and we have done quite many breaking changes in the recent years. My personal position on this is that there's no critical need for such a change. If downstream users want to expose a different value for invalid, they can certainly do that
Thanks @rouault for your swift reply! I can understand your position, we'll discuss it downstream then.
Maybe this can be a PROJ 9+ change :). It would definitely be more performant to not have to change all of the values downstream. One idea I had was adding a flag so users could opt-in to using NaN instead if Inf for invalid values.
If downstream users want to expose a different value for invalid, they can certainly do that
@rouault, are there any cases when HUGE_VAL is returned due to the math versus explicitly setting the value to HUGE_VAL because it is bad?
Another option would be to add a c-api method like proj_set_nodata_value() or proj_set_nodata_nan() that allows the user to choose the nodata value used by PROJ and default to Inf if it is not set.
Would probably need to add a is_nodata function and can modify the return value of proj_coord_error in the code (and probably other places as well).
$ find ../src ../test -name "*.cpp" -exec grep HUGE_VAL {} \; | wc -l
209
@rouault would you be opposed to a PR adding the ability to change the nodata?
would you be opposed to a PR adding the ability to change the nodata?
Maybe @kbevers @busstoptaktik want to chim in ? Looks like a (lot of) pain for little gain to me, but I guess it would have little adverse implications. From a pure theoretical point of view, it might have a tiny performance impact (comparing to a constant vs to a global variable) but it would probably be not measurable in practice. Probably you wouldn't need to change all HUGE_VAL occurences but only user facing ones. You could for example just remap from HUGE_VAL to nodata in the proj_trans_XXX() family, which once we've dropped proj_api.h, will be the only entry points to get a transformed coordinates if I'm not mistaken. But perhaps doing it in the whole code base is safer/clearer. I dunno...
HUGE_VAL vs NaN is a bit anecdotical IMHO. More fundamentally, error reporting in PROJ is a bit messy currently with errno-style mechanism and HUGE_VAL being both used. In particular when transforming several points it is not clear what we should do if only a subset of points fail to transform (and points might fail to transform for different reasons !). That's a more "interesting" (ie complicated) topic.
Brainstorming a bit with @hobu , one subtelty that appears is that if within the same process a PROJ "user" (ie library / component ) changes the nodata value, it is fine. But if another unrelated one still assumes the hard-coded HUGE_VAL, then it will break its assumptions. Or even if 2 components try to set a different nodata value... So changing HUGE_VAL as a global setting could cause some issues. Unless its scope is only to affect a given PJ_CONTEXT. proj_trans() having no explicit context, that would be the context that was used to create the PJ* object.
one subtelty that appears is that if within the same process a PROJ "user" (ie library / component ) changes the nodata value, it is fine. But if another unrelated one still assumes the hard-coded HUGE_VAL, then it will break its assumptions. Or even if 2 components try to set a different nodata value...
Mind giving a more specific example of how this could happen? Would this be an issue internally to PROJ? Or for the external "user"?
Mind giving a more specific example of how this could happen?
Assuming the nodata value will be global and not per-context, this is quite straightforward:
- Library A : proj_set_no_data(NaN)
- Library B living in same process and not aware of A and the fact that HUGE_VAL is no longer hardcoded : c = proj_trans(....) ; if (c.xy.x == HUGE_VAL) deal_with_error()
or
- Library B: proj_set_nodata(HUGE_VAL)
- Library A : proj_set_no_data(NaN)
- Library B: c = proj_trans(....) ; if (c.xy.x == HUGE_VAL) deal_with_error()
Interesting, I was unaware that different libraries/processes using PROJ at the same time could impact global values like that. I was thought that they were more containerized.
Is this also true for the global context?
Side note, here is the API design I was thinking about:
proj.h:
void proj_set_nodata(double value);
int proj_is_nodata(double value);
proj_internal.h
double PROJ_NODATA;
I was thought that they were more containerized.
If you change a global value, then well it is globally visible :-) Let's say that a Python user uses a pyproj version that would change the nodata to be NAN, but at the same time uses let's say PDAL or GDAL python API and PDAL or GDAL when doing PROJ operations still assume error to be HUGE_VAL, then you break PDAL or GDAL.
Is this also true for the global context?
Well yes, how could that no be the case... ? There's no magic. People using PROJ in a library (to be opposed as in the "main" of their application) should NOT mess with the global PROJ context because it isn't thread safe, and even if you don't do multi-threading, different libraries that modify the global context will influence each other without realizing it. Chaos ensured.
So the only reasonable scope would be to change the nodata for a given context. So I'm -1 on "void proj_set_nodata(double value);"
People using PROJ in a library (to be opposed as in the "main" of their application) should NOT mess with the global PROJ context because it isn't thread safe, and even if you don't do multi-threading, different libraries that modify the global context will influence each other without realizing it. Chaos ensured.
Hmmm, not sure how to avoid that. I remember having issues if I didn't set the data directory/log function on the global context in pyproj. Not sure what chaos it causes :upside_down_face:. Would be interesting to test.
The chaos in action with conda:
>>> from osgeo import osr
>>> osr.GetPROJSearchPaths()
['/home/snowal/.local/share/proj', '/home/snowal/miniconda/envs/gdal/share/proj']
>>> import pyproj
>>> osr.GetPROJSearchPaths()
['/home/snowal/miniconda/envs/gdal/share/proj']
Not an issue with wheels ...
API:
proj.h:
void proj_set_nodata(PJ_CONTEXT*context, double value);
int proj_is_nodata(PJ_CONTEXT*context, double value);
Maybe @kbevers @busstoptaktik want to chim in ?
I agree with your conclusions @rouault. This could lead to a big mess.
HUGE_VAL vs NaN is a bit anecdotical IMHO. More fundamentally, error reporting in PROJ is a bit messy currently with errno-style mechanism and HUGE_VAL being both used. In particular when transforming several points it is not clear what we should do if only a subset of points fail to transform (and points might fail to transform for different reasons !). That's a more "interesting" (ie complicated) topic.
I'd be willing to accept a change from HUGE_VAL to NaN if it came as part of a larger effort that completely rewires error handling in PROJ. Perhaps something like returning infity coordinates when an input coordinate lies outside the projectable area and NaN when parameters or something else is the cause of a bad output coordinate. Just an idea of the top of my head, of course more careful thought is needed for a proper error handling revamp!
I'd be willing to accept a change from
HUGE_VALtoNaNif it came as part of a larger effort that completely rewires error handling in PROJ. Perhaps something like returning infity coordinates when an input coordinate lies outside the projectable area andNaNwhen parameters or something else is the cause of a bad output coordinate. Just an idea of the top of my head, of course more careful thought is needed for a proper error handling revamp!
I agree. NaN is a better indicator of "you asked me to do something ridiculous" than HUGE_VAL. However it would be good to think through what it really means. It would be also good to allow NaN as input; and this shouldn't involve an explicit test but may requires some tweaking of conditionals.
I agree.
NaNis a better indicator of "you asked me to do something ridiculous" thanHUGE_VAL. However it would be good to think through what it really means. It would be also good to allowNaNas input; and this shouldn't involve an explicit test but may requires some tweaking of conditionals.
Not the least because HUGE_VAL in IEEE754 compliant arithmetic is identical to INFINITY, which could be a perfectly valid return value for e.g. the Mercator northing for 90°N.
I also agree with @cffk that accepting NaN as input is reasonable. My opinion is that passing NaNs to any PROJ operation should turn that operation into a no-op. This should probably be built into the dispatch logic, rather than into the (huge number of) individual operation implementations (where in many cases, due to IEEE754 arithmetic, where quiet NaNs are just passed on, the end result will be the same).
There is, however, probably numerous devils in the details: For example, for a 4D pipeline, handling 2D coordinates, should the third and fourth coordinate of the tuple be set to NaN? And how should the dispatch logic handle that?
An interesting element of this discussion regarding HUGE_VAL vs. NaN is that PROJ predates the 1985 IEEE785 formalization of the NaN concept by at least a few years. So it's probably not surprising that Gerald I. Evenden in 1983 chose HUGE_VAL rather than NaN to signal that something went badly wrong.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
With #2487, is this something that could be added to 8.0?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Ping to keep open.
For reference, cartopy converts errors to nan: https://github.com/SciTools/cartopy/blob/7c75f5ab6a397bca3197774b063820b6aabe87f9/lib/cartopy/_crs.pyx#L448