asdf icon indicating copy to clipboard operation
asdf copied to clipboard

Introduce `lazy_tree` (super dictionaries)

Open braingram opened this issue 1 year ago • 4 comments

Description

This PR adds:

  • lazy_tree option to AsdfConfig
  • lazy_tree argument to asdf.open (defaults to AsdfConfig.lazy_tree)
  • Converter.lazy attribute used to indicate if a converter supports "lazy" objects
  • asdf.lazy_nodes "lazy" container classes for list, dict, ordered dict

By default the "lazy" option is False.

When lazy_tree is True and an ASDF file is opened the tagged nodes in the tree are not immediately converted to custom objects. Instead, the containers in the tree (dicts, lists, OrderedDicts) are replaced with AsdfNode subclasses that act like these containers and convert tagged values to custom objects when they are accessed (See https://github.com/asdf-format/asdf/discussions/1705 for discussion of this feature). During conversion, if asdf encounters a Converter that either defines lazy=False or does not define lazy the remainder of the branch will be converted to non-"lazy" objects and passed to the Converter. If instead the Converter defines lazy=True the "lazy" object (ie a AsdfDictNode for a dict) will be passed to the Converter.

Checklist:

  • [ ] pre-commit checks ran successfully
  • [ ] tests ran successfully
  • [ ] for a public change, a changelog entry was added
  • [ ] for a public change, documentation was updated
  • [ ] for any new features, unit tests were added

braingram avatar Jan 12 '24 17:01 braingram

converted to draft until https://github.com/asdf-format/asdf/pull/1733#discussion_r1555834679 is addressed

braingram avatar May 07 '24 15:05 braingram

The following branch of roman_datamodels adds lazy to the node converters (and makes some minor node changes to account for AsdfDictNode not passing isinstance(..., dict) etc) to allow lazy loading of roman trees: https://github.com/spacetelescope/roman_datamodels/compare/main...braingram:roman_datamodels:lazy?expand=1

braingram avatar May 13 '24 17:05 braingram

JWST regtests: https://plwishmaster.stsci.edu:8081/job/RT/job/JWST-Developers-Pull-Requests/1571/ passed with 2 unrelated (and common random) failures

romancal regtests: https://github.com/spacetelescope/RegressionTests/actions/runs/9746576095 (I ran the romancal tests with photutils==1.12.0 since 1.13.0 is currently breaking main: https://github.com/spacetelescope/romancal/pull/1291) ran with no failures

braingram avatar Jul 01 '24 15:07 braingram

@nden @perrygreenfield the regtests all pass with this PR (except for the 2 jwst tests that randomly and frequently fail).

braingram avatar Jul 01 '24 18:07 braingram

If I understand how this works from the description, once I request a quantity array, all quantity arrays are loaded into memory. Is this correct?

nden avatar Jul 03 '24 22:07 nden

No, there's something else, not sure what. The above comment is true for quantity arrays. For numpy arrays, it works as expected. Loading one array does not load any other arrays.

nden avatar Jul 03 '24 22:07 nden

Thanks for giving it a try. What file did you use for testing? If it's a roman file things will behave differently if you're using roman_datamodels main vs the "lazy" branch linked above. I think this points to this feature (and PR) needing more documentation.

Here's a non-roman example (please let me know if you give it a try and find anything different from the example) it doesn't require any special versions of anything (except for using asdf from the source branch for this PR).

import asdf
import numpy as np
import astropy.units as u

# make 5 quantiy arrays
qs = [u.Quantity(np.zeros(3+i) + i, u.m) for i in range(5)]

# save them to an ASDF file
af = asdf.AsdfFile()
af["qs"] = qs
af.write_to("test.asdf")

# open the file with a "lazy_tree"
with asdf.open("test.asdf", lazy_tree=True) as af:
    # When opened asdf always reads the first and last block
    # (this is true for lazy and non-lazy trees). Since we
    # are using a 'lazy_tree' only these blocks will be loaded
    # and since these are lazy blocks just the headers will be read.

    print("before accessing quantities")
    print(f"Loaded blocks: {[b.loaded for b in af._blocks._blocks]}")

    # Since we're using a 'lazy_tree' the 'qs' 'list' will be
    # a special AsdfListNode object
    print(f"'qs' type = {type(af['qs'])}")

    # Accessing the first quantity will convert the tagged
    # representation to a quantity
    print(f"qs[0] = {af['qs'][0]=}")
    # but no other blocks will be loaded
    print(f"Loaded blocks: {[b.loaded for b in af._blocks._blocks]}")

    # Accessing the second quantity will cause a block to load
    print(f"qs[1] = {af['qs'][1]=}")
    print(f"Loaded blocks: {[b.loaded for b in af._blocks._blocks]}")

When I run the example I get the following output:

before accessing quantities
Loaded blocks: [True, False, False, False, True]
'qs' type = <class 'asdf.lazy_nodes.AsdfListNode'>
qs[0] = af['qs'][0]=<Quantity [0., 0., 0.] m>
Loaded blocks: [True, False, False, False, True]
qs[1] = af['qs'][1]=<Quantity [1., 1., 1., 1.] m>
Loaded blocks: [True, True, False, False, True]

For the above example the "containers" (list-like AsdfListNode and dict-like AsdfDictNode) objects in the tree are made "lazy" (since lazy_tree=True) and the contained objects only deserialized when they are accessed. For the above example the index 1 item in the "qs" "list" isn't converted to a quantity until it's accessed with qs[1]. At that time asdf turns the tagged representation for index 1 into a quantity (which triggers loading the index 1 block). For the non-accessed items in the "list" (like qs[2]) they're never converted to a quantity in the above example (so for qs[2] the index 2 block is never loaded).

Roman files are a bit different because they use STNode subclasses for containers. If we create a fake "roman" file:

im = roman_datamodels.maker_utils.mk_level2_image()
af = asdf.AsdfFile()
af['roman'] = im
af.write_to("roman.asdf")

If we load it with asdf.open we'll see the following loaded blocks

>> af = asdf.open("roman.asdf", lazy_tree=True)
>> "".join(["1" if b.loaded else "0" for b in af._blocks._blocks])
'100000000000001'

This is only because we haven't accessed the "roman" key from the lazy AsdfDictNode. Accessing "roman" (with the main branch of roman_datamodels) results in many blocks being loaded (every block that maps to a quantity):

>> af["roman"]
>> "".join(["1" if b.loaded else "0" for b in af._blocks._blocks])
'111100001101111'

This is because the converter that deserializes roman_datamodels.stnode.WfiImage doesn't set lazy=True (by default asdf assumes nothing in extensions is lazy). Since the converter for this object isn't lazy asdf will convert everything within the sub-tree that it hands to the converter before calling the converter (so the converter never sees a lazy node, matching the current asdf behavior). Accessing af["roman"] triggers asdf to convert everything within the WfiImage sub-tree.

If instead, we use the modified version of roman_datamodels (which sets lazy=True for the converter handling WfiImage) things are much "lazier".

>> af["roman"]
>> "".join(["1" if b.loaded else "0" for b in af._blocks._blocks])
'100000000000001'

Here accessing just the top level "roman" doesn't trigger asdf to load everything within that sub-tree (since the converter has lazy=True) but if we access the data 1 block will be loaded (the exact one may differ depending on the order of the blocks).

>> af["roman"]["data"]
>> "".join(["1" if b.loaded else "0" for b in af._blocks._blocks])
'100000000100001'

braingram avatar Jul 04 '24 01:07 braingram

Thanks!

I updated the lazy_tree documentation in: 5843adaac595cc77c2ff5633d2709fcd08db6cbc

Does the updated description sound good? (emphasis added to the new part below, see the commit for the full text and context).

lazy_tree : bool, optional When True the ASDF tree will not be converted to custom objects when the file is loaded. Instead, objects will be "lazily" converted only when they are accessed. Note that the tree will not contain dict and list instances for containers and instead return instances of classes defined in asdf.lazy_nodes. Since objects are converted when they are accessed, traversing the tree (like is done during AsdfFile.info and AsdfFile.search) will result in nodes being converted.

braingram avatar Jul 11 '24 19:07 braingram