cf-python icon indicating copy to clipboard operation
cf-python copied to clipboard

Set numerical tolerance when doing comparisons with cf.Query objects

Open davidhassell opened this issue 5 years ago • 5 comments

Currently, the numerical queries cf.le, cf.ge and cf.wi are susceptible to rounding issues, as the inequality operators can not be modified by the settings of rtol and atol.

This could be changed by replacing the operations with a strict inequality operation and a numerically tolerant equality operation:

E.g.

cf.le(5) = 5.00000000000001

would effectively be implemented as

5 < 5.00000000000001 | numpy.allclose(5, 5.00000000000001, rtol=rtol, atol=atol)

This would help greatly when using cf.Query in, for example, subspace definitions.

API: the functions should have rtol and atol keyword parameters. (cf.wi(3, 4, rtol=45, atol=0.0002))

Possible issue: Under this scheme, cf.wo would not be the exact inverse of cf.wi

davidhassell avatar Dec 17 '20 17:12 davidhassell

Possible issue: Under this scheme, cf.wo would not be the exact inverse of cf.wi

Ooh that sounds potentially a bit nasty... fun for a later release I suppose!

sadielbartholomew avatar Dec 17 '20 17:12 sadielbartholomew

Currently, == and != are on cf.Data objects are always numerically tolerant, using the current environmental settings of rtol and atol (which default to the system epsilon) https://github.com/NCAS-CMS/cf-python/blob/master/cf/data/data.py#L4790-L4795

                if method == '__eq__':
                    array0 = numpy.isclose(array0, array1, rtol=rtol, atol=atol)
                elif method == '__ne__':
                    array0 = ~numpy.isclose(array0, array1, rtol=rtol, atol=atol)

Perhaps this is not appropriate. all six comparison operation or should be intolerant, as they are in numpy, This would open the way up for cf.Query objects to provide a way of setting tolerance on a case-by-case basis. Something like:

  • cf.eq(6) - not tolerant
  • cf.eq(6, tol=True) - tolerant, uses environment settings of atol, rtol
  • cf.eq(6, rtol=0.000001, atol=1e-9) - tolerant, use the given values for rtol, atol

and similarly for the other five comparison operators.

davidhassell avatar Feb 08 '21 09:02 davidhassell

Interesting points, @davidhassell. In my eyes the ideal case is to allow full configuration of the tolerance (or lack of) in some simple and intuitive way, and your idea:

This would open the way up for cf.Query objects to provide a way of setting tolerance on a case-by-case basis. Something like: cf.eq(6) - not tolerant cf.eq(6, tol=True) - tolerant, uses environment settings of atol, rtol cf.eq(6, rtol=0.000001, atol=1e-9) - tolerant, use the given values for rtol, atol

sounds like a neat way to do that.

The one point I think that might be debatable about it is whether the default should be to have no tolerance (as suggested above i.e. tol=False), or the global tolerance (i.e. tol=True), as I think different users might prefer different things. Could that in itself be a configuration option (a Boolean global item default_tolerant or something) or am I being overzealous on configurability (it's certainly possible)...

Either way, I support the suggestions made.

sadielbartholomew avatar Feb 08 '21 14:02 sadielbartholomew

... re-reading my comment I am thinking I might have missed the point by questioning the default being zero tolerance.

I thought I'd check the Python Array API Standard (though you may already have done so) in case there was anything there to guide, but there's nothing on tolerance per-se in the location I'd expect it, namely: https://data-apis.github.io/array-api/latest/design_topics/accuracy.html.

sadielbartholomew avatar Feb 08 '21 14:02 sadielbartholomew

I hadn't checked the Array API Standard, just what I knew of numpy - I guess it's no surprise that they are consistent. I like your idea of a "default tolerant" config variable. I think things are a bit inconsistent right now. For example, Data.equals is always tolerant, but comparisons in cf.aggregate are, by default, not (although they can be forced to be, at the expense of performance).

I don't think we should leap into anything soon - rather think about it after the migration to dask has occured. Best not change our API before then, unless we really have to.

davidhassell avatar Feb 08 '21 16:02 davidhassell

Fixed by #674

davidhassell avatar Nov 22 '23 11:11 davidhassell