asdf
asdf copied to clipboard
Introduce `lazy_tree` (super dictionaries)
Description
This PR adds:
-
lazy_tree
option toAsdfConfig
-
lazy_tree
argument toasdf.open
(defaults toAsdfConfig.lazy_tree
) -
Converter.lazy
attribute used to indicate if a converter supports "lazy" objects -
asdf.lazy_nodes
"lazy" container classes for list, dict, ordered dict
By default the "lazy" option is False
.
When lazy_tree
is True
and an ASDF file is opened the tagged nodes in the tree are not immediately converted to custom objects. Instead, the containers in the tree (dicts, lists, OrderedDicts) are replaced with AsdfNode
subclasses that act like these containers and convert tagged values to custom objects when they are accessed (See https://github.com/asdf-format/asdf/discussions/1705 for discussion of this feature). During conversion, if asdf encounters a Converter
that either defines lazy=False
or does not define lazy
the remainder of the branch will be converted to non-"lazy" objects and passed to the Converter
. If instead the Converter
defines lazy=True
the "lazy" object (ie a AsdfDictNode
for a dict) will be passed to the Converter
.
Checklist:
- [ ] pre-commit checks ran successfully
- [ ] tests ran successfully
- [ ] for a public change, a changelog entry was added
- [ ] for a public change, documentation was updated
- [ ] for any new features, unit tests were added
converted to draft until https://github.com/asdf-format/asdf/pull/1733#discussion_r1555834679 is addressed
The following branch of roman_datamodels
adds lazy
to the node converters (and makes some minor node changes to account for AsdfDictNode
not passing isinstance(..., dict)
etc) to allow lazy loading of roman trees:
https://github.com/spacetelescope/roman_datamodels/compare/main...braingram:roman_datamodels:lazy?expand=1
JWST regtests: https://plwishmaster.stsci.edu:8081/job/RT/job/JWST-Developers-Pull-Requests/1571/ passed with 2 unrelated (and common random) failures
romancal regtests: https://github.com/spacetelescope/RegressionTests/actions/runs/9746576095
(I ran the romancal tests with photutils==1.12.0
since 1.13.0 is currently breaking main: https://github.com/spacetelescope/romancal/pull/1291)
ran with no failures
@nden @perrygreenfield the regtests all pass with this PR (except for the 2 jwst tests that randomly and frequently fail).
If I understand how this works from the description, once I request a quantity array, all quantity arrays are loaded into memory. Is this correct?
No, there's something else, not sure what. The above comment is true for quantity arrays. For numpy arrays, it works as expected. Loading one array does not load any other arrays.
Thanks for giving it a try. What file did you use for testing? If it's a roman file things will behave differently if you're using roman_datamodels main vs the "lazy" branch linked above. I think this points to this feature (and PR) needing more documentation.
Here's a non-roman example (please let me know if you give it a try and find anything different from the example) it doesn't require any special versions of anything (except for using asdf from the source branch for this PR).
import asdf
import numpy as np
import astropy.units as u
# make 5 quantiy arrays
qs = [u.Quantity(np.zeros(3+i) + i, u.m) for i in range(5)]
# save them to an ASDF file
af = asdf.AsdfFile()
af["qs"] = qs
af.write_to("test.asdf")
# open the file with a "lazy_tree"
with asdf.open("test.asdf", lazy_tree=True) as af:
# When opened asdf always reads the first and last block
# (this is true for lazy and non-lazy trees). Since we
# are using a 'lazy_tree' only these blocks will be loaded
# and since these are lazy blocks just the headers will be read.
print("before accessing quantities")
print(f"Loaded blocks: {[b.loaded for b in af._blocks._blocks]}")
# Since we're using a 'lazy_tree' the 'qs' 'list' will be
# a special AsdfListNode object
print(f"'qs' type = {type(af['qs'])}")
# Accessing the first quantity will convert the tagged
# representation to a quantity
print(f"qs[0] = {af['qs'][0]=}")
# but no other blocks will be loaded
print(f"Loaded blocks: {[b.loaded for b in af._blocks._blocks]}")
# Accessing the second quantity will cause a block to load
print(f"qs[1] = {af['qs'][1]=}")
print(f"Loaded blocks: {[b.loaded for b in af._blocks._blocks]}")
When I run the example I get the following output:
before accessing quantities
Loaded blocks: [True, False, False, False, True]
'qs' type = <class 'asdf.lazy_nodes.AsdfListNode'>
qs[0] = af['qs'][0]=<Quantity [0., 0., 0.] m>
Loaded blocks: [True, False, False, False, True]
qs[1] = af['qs'][1]=<Quantity [1., 1., 1., 1.] m>
Loaded blocks: [True, True, False, False, True]
For the above example the "containers" (list-like AsdfListNode
and dict-like AsdfDictNode
) objects in the tree are made "lazy" (since lazy_tree=True
) and the contained objects only deserialized when they are accessed. For the above example the index 1 item in the "qs" "list" isn't converted to a quantity until it's accessed with qs[1]
. At that time asdf turns the tagged representation for index 1 into a quantity (which triggers loading the index 1 block). For the non-accessed items in the "list" (like qs[2]
) they're never converted to a quantity in the above example (so for qs[2]
the index 2 block is never loaded).
Roman files are a bit different because they use STNode
subclasses for containers. If we create a fake "roman" file:
im = roman_datamodels.maker_utils.mk_level2_image()
af = asdf.AsdfFile()
af['roman'] = im
af.write_to("roman.asdf")
If we load it with asdf.open
we'll see the following loaded blocks
>> af = asdf.open("roman.asdf", lazy_tree=True)
>> "".join(["1" if b.loaded else "0" for b in af._blocks._blocks])
'100000000000001'
This is only because we haven't accessed the "roman" key from the lazy AsdfDictNode
. Accessing "roman" (with the main branch of roman_datamodels) results in many blocks being loaded (every block that maps to a quantity):
>> af["roman"]
>> "".join(["1" if b.loaded else "0" for b in af._blocks._blocks])
'111100001101111'
This is because the converter that deserializes roman_datamodels.stnode.WfiImage
doesn't set lazy=True
(by default asdf assumes nothing in extensions is lazy). Since the converter for this object isn't lazy asdf will convert everything within the sub-tree that it hands to the converter before calling the converter (so the converter never sees a lazy node, matching the current asdf behavior). Accessing af["roman"]
triggers asdf to convert everything within the WfiImage
sub-tree.
If instead, we use the modified version of roman_datamodels (which sets lazy=True
for the converter handling WfiImage
) things are much "lazier".
>> af["roman"]
>> "".join(["1" if b.loaded else "0" for b in af._blocks._blocks])
'100000000000001'
Here accessing just the top level "roman" doesn't trigger asdf to load everything within that sub-tree (since the converter has lazy=True
) but if we access the data
1 block will be loaded (the exact one may differ depending on the order of the blocks).
>> af["roman"]["data"]
>> "".join(["1" if b.loaded else "0" for b in af._blocks._blocks])
'100000000100001'
Thanks!
I updated the lazy_tree
documentation in:
5843adaac595cc77c2ff5633d2709fcd08db6cbc
Does the updated description sound good? (emphasis added to the new part below, see the commit for the full text and context).
lazy_tree : bool, optional When
True
the ASDF tree will not be converted to custom objects when the file is loaded. Instead, objects will be "lazily" converted only when they are accessed. Note that the tree will not contain dict and list instances for containers and instead return instances of classes defined inasdf.lazy_nodes
. Since objects are converted when they are accessed, traversing the tree (like is done duringAsdfFile.info
andAsdfFile.search
) will result in nodes being converted.