ASDF handling of numpy scalar values (and scalar precision)
Based off discussions in recent PRs: https://github.com/asdf-format/asdf/pull/1474#discussion_r1148473915 https://github.com/asdf-format/asdf/pull/1475#discussion_r1139089411
ASDF (and the ASDF standard) should likely clarify how to handle scalar values.
For a short example of what should be clarified. Suppose a user adds a scalar (non-array) 'numpy.uint8' to the tree, then saves, then loads the tree. Should the value be a 'numpy.uint8', int?
To summarize the discussion so far based off contributions by @perrygreenfield
What should the standard require?
Perhaps support for "The greatest common types for integers, floats and complex values across standard platforms... int64, float64 and complex128". With larger types possibly supported with associated metadata.
What is required for round-tripping?
"Up to the library", so for python:
- require permission to write out larger than standard values
- raise a warning when the library doesn't support the size and return a special object
- require round-tripping the special objects
- raise an exception if the standard requirement is not met
How should numpy scalars that fit within the standard scalar?
The library will return the largest (standard) type which can violate round-tripping. Any user that expects a 'numpy.uint8' to stay a 'numpy.uint8' (and not be promoted to an int) should convert the value separately.
Some guidance is already provided in the standard for integers: https://asdf-standard.readthedocs.io/en/1.0.3/known_limits.html#literal-integer-values-in-the-tree
One thought, which I think is in partial agreement with what @perrygreenfield described would be to say that scalars in the tree can be:
- int64
- float64
- complex128
By default, ASDF (the python implementation) will convert any numpy scalar that fits into these types (and violate round-tripping, this means
numpy.uint8will become int). For any value outside of this range, I'm thinking we should throw an error mentioning the limitation and directing the user to a page with more information and a description of how to save the scalar as a ndarray (which will round trip without the same scalar issues). We could enable an option to disable the automatic conversion from numpy scalar to builtin type (and instead throw an error).
Copying from: https://github.com/asdf-format/asdf/pull/1605#discussion_r1276449384
""" Now that light has been shined on the issue of numpy scalars, I think ASDF should work towards serializing numpy scalars differently than the built in python scalars. NEP 41 is working towards having more complicated dtypes (which would apply to scalars too). One of the main motivations for this effort is to encode units into the dtype itself (for performance). Looking towards this indicates that ASDF might want to start considering how to encode the dtype for scalars. """
Linking this to asdf-format/asdf-standard#396, which is a proposal to add a scalar type to be added to the standard.
I wonder if this is another instance of trying to do too much with ASDF, similar to the argument I made about tuples previously. What sense is there in storing float16 scalars in an ASDF file? I am leaning towards thinking that if people want to deal with these in a consistent way, it is for them to address it specifically rather than add things like dtypes to scalar floats. Can we just assume they are doubles and leave it at that?