kedro-wings icon indicating copy to clipboard operation
kedro-wings copied to clipboard

Feature Request: support dataset layers

Open Sitin opened this issue 3 years ago • 0 comments

There is a handy feature in Kedro: special tags for datasets according to the data engineering convention. It is quite useful in combination with kedro viz or (as in my case) for creating a UI on top of Kedro and sorting/filtering datasets according to their position in a pipeline.

I think we need to support this feature and provide some way to specify the layer for pipeline checkpoints.

Currently I am subclassing KedroWings and use simple rules based on the first two letters of dataset locations:

class ExplicitLark(KedroWings):
    LAYERS: Dict[str, str] = {
        '01': 'raw',
        '02': 'intermediate',
        '03': 'primary',
        '04': 'feature',
        '05': 'model_input',
        '06': 'model',
        '07': 'model_output',
        '08': 'reporting',
    }

   @hook_impl    
   def before_pipeline_run(
            self, run_params: Dict, pipeline: Pipeline, catalog: DataCatalog, name: str = None,
    ):
        super(EarlyBird, self).before_pipeline_run(run_params, pipeline, catalog)
        self._update_layers(catalog)

    def _update_layers(self, catalog: DataCatalog):
        for dataset_name in catalog.list(regex_search=r'^\d+_.*'):
            layer_code = dataset_name[:2]
            if layer_code in self.LAYERS:
                layer_name = self.LAYERS[layer_code]
                catalog.layers[layer_name] = catalog.layers.get(layer_name, set())
                catalog.layers[layer_name].add(dataset_name)

I think we can provide additional parameter to the KedroWings which accepts dictionary with regular expression -> layer name and use the default convention for XX_* datasets.

Sitin avatar Dec 18 '20 10:12 Sitin