read to standard numpy data types, python types etc.
I ran into something a bit funny relative to my expectations.
The data type of numpy arrays when read back in is not a subtype of say np.ndarray, the actual dict structure is not a python dict, etc.
>>> print(type(md), type(md["z"]), isinstance(md["z"], np.ndarray))
<class 'asdf.tags.core.AsdfObject'> <class 'asdf.tags.core.ndarray.NDArrayType'> False
I get why this happens, but it breaks the ability of data to "round trip" to and from files and get the same thing back out, interface the ASDF data into other tools, etc.
Is there an option or can one add an option to read the whole tree back into non-ASDF types?
This operation would not allow lazy loading, etc., which is the expected behavior.
Thanks for opening the issue.
You're spot on that currently some values get converted to different classes during a write/read cycle.
The top level dictionary in an AsdfFile becomes an AsdfObject instance (this is a dict subclass). This is in part because it gets a custom tag and becomes associated with the asdf-1.1.0 schema. ASDF also adds a few keys (asdf_library and history). These are used to track file origin, etc.
Nested dictionaries (with default options) roundtrip as dictionary instances.
asdf.AsdfFile({"d": {"a": 1}}).write_to("foo.asdf")
af = asdf.open("foo.asdf")
assert type(af["d"]) is dict
As you noted, arrays are read as NDArrayType instances (which is not a ndarray subclass) in part to allow "lazy loading".
One thing we are currently considering is adding load and dump functions https://github.com/asdf-format/asdf/discussions/1873 As load would not have access to an open file it couldn't be lazy and having it return ndarray instead of NDArrayType seems most usable. For the top level object I'd have to look at what returning a dict instead of an AsdfObject might mean. I think it should be fine.
Do you think load (if it returned ndarray and dict for the top level object) would work for your use case?
Yes load and dump would do the trick for sure, but I am confused as to why those are needed in the first place.
Maybe it goes back to this. The API mixes ideas around the AsdfFile being a file pointer versus an in-memory data format that can serialize itself.
Reads in the tutorials look like:
with asdf.open("blah.asdf", "r") as fp:
print(fp["blah"])
fp = asdf.open("blah.asdf", "r")
# do something
fp.close()
Writes look like:
af = asdf.AsdfFile(tree=data)
af.write_to(...)
Given the code that reads above, I would expect code like
with asdf.open("blah.asdf", "w") as fp:
# no data in the file
assert fp.tree is None
# makes a top-level tree for you
fp["blah"] = 10
# allow trees to be made as needed
fp["foo"]["bar"] = 10
# set the tree directly
fp.tree = {"blah": 10}
to just work as a way to write data to a file "blah.asdf".
However, it instead appears that
- the write semantics treat things as a data structure w/ serialization
- the read semantics treat
AsdfFileas a file handle - for reads, unlike a real file handle, the read API does not supply, as far as I know, a way to fully read the data, close the underlying file handle, and then use the data
I may have missed something, but the addition of load and dump to the APIs above would to me be even more confusing.
Given how much lazy loading and memory mapping are emphasized in the API, I think actually AsdfFile is meant to be an actual file handle that should be held in a context manager. If so, then the solution that makes most sense to me is to
- supply a method on
AsdfFileto non-lazily read all of the data into non-asdf types (e.g.,.load()) - remove the
write_toAPI in favor of the context manager write API
Following https://github.com/asdf-format/asdf/pull/1929 lazy_load=False returns numpy.ndarray instances.
AsdfObject vs dicts for the top level mapping is more fundamental to the file format. There are several things that happen to the top level object that make a provided dictionary not round-trip (adding extension information etc). I don't see this as an issue with the python asdf implementation (which is providing this often useful information that is recorded in the file).
I'm going to close this issue since the ndarray round-tripping is addressed (with the next asdf version) and top level dictionary vs AsdfObject changes aren't planned with the current file format version. Feel free to reopen this if more discussion is helpful.