dask [Rough Idea] Remove HighLevelGraph/Layer.cull by simplifying Layer

After beginning to push on #9117 (and #9159) for a bit, I decided it might be worthwhile to investigate ways to improve the existing HighLevelGraph/Layer design. In addition to the serialization pains that will hopefully be resolved by dask/distributed#6028, my impression is that a key problem with HLGs is that it is unnecessarily difficult to implement a new Layer. Others my disagree with my perspective on this, but I suspect that this has a lot to do with (1) Layer inheriting from Mapping, and (2) the requirement to implement a cull method that avoids materialization.

I propose that we consider moving away from Layer(Mapping), and make the Layer design a bit simpler. For example, it may make sense to boil the base Layer class down to three primary methods: get_output_keys, subgraph, and subgraph_dependencies (where subclasses would only be required to implement get_output_keys and subgraph):

class Layer:
    """Base Layer Class"""

    def get_output_keys(self) -> Set:
        """Return all possible output keys"""
        raise NotImplementedError

    # NEW
    def subgraph(self, keys: Set) -> Mapping:
        """Return the local subgraph required to produce ``keys``"""
        raise NotImplementedError

    # NEW
    def subgraph_dependencies(
        self,
        output_keys: Set,
        subgraph: Mapping,
        layer_dependencies: Mapping[str, Layer],
    ) -> Mapping[str, Set]:
        """Return external key dependencies for this layer"""
        from dask.core import keys_in_tasks
        
        # Default implementation ignores `output_keys`, but
        # custom Layers may know the dependency keys from
        # the output keys alone
        return {
            dep: keys_in_tasks(
                set(dep_layer.get_output_keys()),
                [subgraph],
            ) for dep, dep_layer in layer_dependencies.items()
        }

Using this design, HLG-to-dict “materialization” can easily capture culling automatically. I started experimenting with this in rjzamora:avoid-cull (although that branch still uses Layer(Mapping) for now), and I found that ensure_dict can be revised so that culling is applied without any HighLevelGraph.cull optimization. Also, if I understand correctly, overriding subgraph_dependencies for Blockwise would effectively resolve #8570 (the linear scaling of the current culling approach).

To summarize, I propose that we consider moving away from Layer(Mapping) and HighLevelGraph/Layer.cull, and adopt a simpler Layer API. The boiled down API I have in mind would require (1) a method to materialize the subgraph for a specific set of output keys, and (2) a method to return all possible output keys for that layer. I expect the primary benefits to be simplified Layer development, and improved culling performance. However, simplifying the Layer API may also make it more tractable to introduce the necessary collection information and/or metadata we ultimately need for general HLG optimizations.

NOTE: Please feel free to poke holes in this idea as aggressively as you want :)

Jun 24 '22 20:06 rjzamora

cc @ian-r-rose - Who seems fond of Mapping :)

Jun 24 '22 20:06 rjzamora

I like contracts :)

Jun 24 '22 20:06 ian-r-rose

dask dask copied to clipboard

[Rough Idea] Remove HighLevelGraph/Layer.cull by simplifying Layer

dask
dask copied to clipboard