EasierRDF
EasierRDF copied to clipboard
Address Pitfalls of Numerical Datatypes in RDF
There are a couple of issues with numerical datatypes that make the accurate use of RDF for numerical data error-prone.
The use of xsd:float
and xsd:double
entails a risk
- of value distortion in the mapping between lexical space and value space (e.g.
"0.1"^^xsd:float
is typically mapped to the value0.1000000014901161
), and - of numerical issues in the processing (e.g. calculations in SPARQL queries) of the represented values, i.e. underflow errors, overflow errors, rounding errors, cancellation, and error accumulation .
In most cases, xsd:decimal
would be a better choice:
- In particular, I disagree with XML Schema Datatypes in RDF and OWL, W3C Working Group Note 14 March 2006 on the point that
xsd:float
andxsd:double
are the appropriate datatypes for measurements. In my point of view, this only holds in case of measurements that origin from binary floating point sources (e.g. numeric calculations or outputs of analog-to-digital converters). Other measures typically have a value and the measurement uncertainty of the used measurement device, resulting in the representation by two precise values, which should both be represented withxsd:decimal
. - Another exception are cases, where a representation of
Infinite
is required, which is only provided byxsd:float
andxsd:double
.
The use of xsd:decimal
for value representation does not considerably impede the use of floating point arithmetic for calculations (e.g. for performance reasons), as the conversion is trivial. In contrast, if a rounding of the lexical representation must be avoided, the other direction would require non standard-conform and (depending on the framework) probably cumbersome to implement custom lexical mappings, and is not always possible (e.g. inside of SPARQL queries).
However, I don't see awareness for these issues in general and especially in teaching material.
Further, RDF unnecessarily inherits limitations from XSD: Exponential notation is only supported for xsd:float
and xsd:double
, but not for xsd:decimal
(and derived datatypes). It was not included into xsd:decimal
as the requirement was already meet with the precisionDecimal datatype, which however, did not become a built-in datatype in RDF. This tempts users to use xsd:double
even if not appropriated. The shorthand syntax in Turtle, TriG and SPARQL additionally amplifies this, as xsd:double
might be used even if not intended.
(A more detailed discussion of the issues can be found in arXiv:2011.08077 and some reviewer comments on it.)
Possible Actions
I think the following actions would help to ease the accurate representation of numbers in RDF:
- Enable exponential notation for
xsd:decimal
(and derived datatypes) in RDF. - Emphasis in teaching material the implicated risk of numerical issues and the only partial coverage between lexical space and value space of
xsd:float
/xsd:double
resulting in rounded values after the lexical mapping. - Enable tools to hint for the use of
xsd:decimal
in favor ofxsd:float
andxsd:double
and to warn users if a lexicalxsd:float
orxsd:double
value was entered which would require rounding during the lexical mapping. - Maybe change Turtle, TriG and SPARQL syntax to use exponential notation as shorthand syntax for
xsd:decimal
instead ofxsd:double
.
One to three would not cause any backward compatibility problems. Four however, would obviously cause backward compatibility problems ins software, but might at the same time increase the accuracy of value representations in existing RDF documents without change.
Further, one could think about adding mandatory support for precisionDecimal
(to have an arbitrary precision datatype with a representation of Infinite
), but that is a new feature and goes beyond making RDF easier.
To make this issue more actionable, here a little more details, some thoughts about requirements and a solution sketch.
Problem
- For the datatypes
xsd:float
andxsd:double
multiple lexical representations get mapped to the same value using rounding. For example,"0.1"^^xsd:float
gets mapped to0.100 000 001 4...
. This fools data curators to state precise numbers, when actually stating slightly different values. -
xsd:float
andxsd:double
force compliant implementations to use floating point arithmetic, or to use rounded input values for a calculation with decimal arithmetic with arbitrary precision.xsd:decimal
forces full compliant implementations to use decimal arithmetic with arbitrary precision, or forces limited compliant implementations to preserve a precision of at least 16 digits (one more than double precision floating point arithmetic guaranties). Even popular implementations (e.g. Virtuoso) fail to comply to this. The actually needed precision of calculations is a matter of the application problem, not the data used. However, RDF requires data curators to make a decision about them. Currently, RDF restrict the selection of the arithmetic reasonable for a problem, which might make compliant implementations less efficient, harder or impossible to write (e.g. due to hardware capabilities, response time constraints and language/library support), or less precise than required. - Syntactic sugar in JSON-LD, Turtle, TriG and SPARQL, as well as missing support for infinite values, NaN (see e.g. OM issue 57) and the exponential notation support tempts data curators to use
xsd:float
andxsd:double
and thereby to distort the stated values.
For a more detailed description of the problem refer to The Problem with XSD Binary Floating Point Datatypes in RDF (talk recording).
Requirements
A couple of requirements follows from these problems:
- Avoid partial coverage of lexical spaces by value spaces to avoid ambiguity and to not fool data curators.
- Do not restrict the choice of an arithmetic with the data.
- Permit exponential notation for arbitrary precise numbers.
- Existing data can be used by new software.
- Existing distorted data get fixed.
- Enable explicit binary representation of IEEE 754 binary32 (float) or IEEE 754 binary64 (double) values that can not get misinterpreted as decimal number.
Solution Draft
As a basis for discussion I would like to propose the following (challenging/maybe unrealistic) list of changes to address the problem:
- Add exponential notation to the lexical space of
xsd:decimal
.- Already implemented in e.g. Blazegraph.
- Add
NaN
,-Inf
,Inf
, and+Inf
to the lexical space and value space ofxsd:decimal
. - Relax the minimal 16 digits constraint for
xsd:decimal
on minimally conforming implementations. - Add datatype
…:HexFloat
with lexical spaces0x0000
to0xffff
/0xFFFF
and value space of IEEE 754 binary32. - Add datatype
…:HexDouble
with lexical spaces0x00000000
to0xffffffff
/0xFFFFFFFF
and value space of IEEE 754 binary64. - Interpret non integer numbers in JSON-LD as
xsd:decimal
instead ofxsd:double
.-
Permitted according to ECAM-404.
-
Permitted according to RFC8259:
This specification allows implementations to set limits on the range and precision of numbers accepted. Since software that implements IEEE 754 binary64 (double precision) numbers [IEEE754] is generally available and widely used, good interoperability can be achieved by implementations that expect no more precision or range than these provide, in the sense that implementations will approximate JSON numbers within the expected precision. A JSON number such as 1E400 or 3.141592653589793238462643383279 may indicate potential interoperability problems, since it suggests that the software that created it expects receiving software to have greater capabilities for numeric magnitude and precision than is widely available.
Summarized: Expect non IEEE 754 binary64 values to get approximated.
-
Possible due to point 1, 2 and 3.
-
- Interpret numbers in exponential notation in Turtle as
xsd:decimal
instead ofxsd:double
. Possible due to point 1. - Interpret numbers in exponential notation in TriG as
xsd:decimal
instead ofxsd:double
. Possible due to point 1. - Interpret numbers in exponential notation in SPARQL as
xsd:decimal
instead ofxsd:double
. Possible due to point 1. - Interpret explicitly typed
xsd:float
andxsd:double
literals asxsd:decimal
. Possible due to point 1 and 2. - Deprecate
xsd:float
andxsd:double
. Possible due to point 6 to 10.
Compatibility Considerations
Old implementations with new data:
- might fail to parse
xsd:decimal
literals with exponential notation - might fail to parse
xsd:decimal
literals withNaN
,-Inf
,Inf
, or+Inf
- can not parse
…:HexFloat
literals - can not parse
…:HexDouble
literals
New implementations with old data:
- value of
xsd:float
andxsd:double
literals might slightly change- in most cases, this removes value distortion / improves data quality
Old implementations interacting with new/upgraded implementations:
- might fail due to
xsd:float
orxsd:double
literals in a SPARQL query result that turn intoxsd:decimal
literals
This would of course not be the easiest change to the RDF standards, especially as it also touches the XML standards. But I think, it is important to address this to make RDF a reliable framework for the representation of numeric data. What do you think about it? (e.g. @afs, @VladimirAlexiev, @gkellogg, @namedgraph)
@danbri might have an opinion :)