Handle uncertainty in data values
Handle data uncertainty in the OPTIMaDe API.
CIF syntax allows to specify uncertainty in parenthesis following a value, for example, 1.23(3), that stands for 1.23 ± 0.03. Having this in JSON would require numbers to be stored in strings what is not so elegant.
Again we'd probably have to choose between the following:
{ "key": { "value": 1.23, "uncertainty": 0.03 } }
and
{ "values": { "key": 1.23 }, "uncertainties": { "key": 0.03 } },
and the choice depends on which way is more convenient to handle in our use-cases.
I view data value uncertainty as a 'flexibility' for data records that only a subset of databases are going to provide, and only a subset of clients are interested in.
Speaking very generally about such flexibilities, I see two primary types:
- Flexibilities of type 1: extra information that supplements a data record. I.e., the data record is still relevant if this supplementary information is left out or ignored.
- Flexibilities of type 2: extra information that is absolutely essential to correctly represent certain data records. In this case data records that require this flexibility need to be left out/disregarded by implementations that do not support it.
For both these cases, I've arrived at the position to prefer them being handled as much as possible via extra optional members outside the other data. This keeps a straightforward 'core' format for implementations that do not care about each type of flexibilty.
For flexibilities of type 1, we then just need to agree on a suitable standard format for the optional members, and databases and clients can then chose to include or ignore them as desired. Flexibilities of type 2 can be handled the same way, but also need an additional facility for a client to extract data records with only the type-2-flexibilities it knows how to handle. A separate discussion about the same issue for the structure standard in #23 had this as a suggestion:
structure_features = [ <set of strings for optional features, e.g., "disordered", "assembly"> ]
(This format seems well-suited for formulating filters that pick out data records with/without specific features using the new HAS set-type filters in #16.)
Now, specifically for the present issue of data uncertainty, I would think this is firmly a 'type-1 flexibility' (?)
In line with what I've written above, I propose that for every member where uncertainty makes sense (i.e., mostly float-valued) that is named X, there is an optional member named X_uncertainty that takes the same format as X, i.e., single value or list, matrix, etc. X_uncertainty gives a CIF-like uncertainty for each value as a decimal number.
One can however think about more sophisticated formats for the uncertainty. For example allowing different probability distributions with their own set of parameters; ~ {"type": "gaussian", "parameters": {"standard_deviation": 0.05} } or being more expressive ~ { "precision": 0.05, "accuracy":, 0.1 }, etc. People who intend to use this feature should come with specific suggestions here.
The proposal by @rartino seems very reasonable to me.
Note: the proposal I posted in #74 specifically for a properties endpoint does include a quite serious way to indicate uncertainty specifically for property values.
It does, however, not cover uncertainty in the structure specification data under structures, and I see no easy way to tie these together.
I don't think there is need for other uncertainty. E.g., I don't expect anything to ever be 'uncertain' with values under calculations, references, etc.
I'd like to refresh this discussion with another aspect of uncertainty. Recent work in Jmol relating to finding space groups, identifying Wyckoff positions, and carrying out proper packing depends critically upon knowing at least the general precision of fractional atom coordinates and lattice parameters. For computational work, it's relatively easy in other string-based input formats to determine that a description is 65-bit precision because all the (important) digits are there and we are doing our own string parsing. But for experimental work, it's not as easy. For example, an atomic y-coordinate of 0.66667 for Mo in https://www.aflowlib.org/prototype-encyclopedia/CIF/AB2_hP6_194_c_f.cif is probably 2/3, as the implicit precision is only +/-0.00001. The problem for me is that a JSON parser will convert this to 0.666670000000, which of course is not 2/3.
Maybe I am just asking for suggestions on how to handle this situation with a JSON parser short of writing a specialized parser just for this purpose that would return a "JSONDouble" that contains the method getPrecision().
@BobHanson It is not uncommon (at least in Python) for JSON parsers to allow 'plugins' that gets to inspect data during parsing to unserialize complex objects. You could check if your parser allows this (or if there is a good alternative that does) to parse JSON "numbers" into your own objects that keep track of the number of digits in the source file.
I know Jmol already implements a lot of symmetry handling, so maybe you already do this, but the other thing that comes to mind is that if the file comes with any other info about symmetries, e.g., spacegroup, Wyckoff assignments, etc. you can go through each atom that is reasonably close to a Wyckoff position and figure out if you should nudge the coordinates into exactly that Wyckoff position for the structure to confirm with the symmetry info (and the chemical formula).