kedro icon indicating copy to clipboard operation
kedro copied to clipboard

Extend path-like keys in the catalog

Open PetitLepton opened this issue 2 years ago • 4 comments

Description

In the current implementation of the parsing of the context, the function _convert_paths_to_absolute_posix — which converts relative to absolute paths — restrict the conversion to three types of keys, namely

    # only check a few conf keys that are known to specify a path string as value
    conf_keys_with_filepath = ("filename", "filepath", "path")

This is very restrictive and can lead to unexpected issues while implementing custom data set as explained in the example below.

Context

Let's imagine that I want to extend the SQLQueryDataSet by using a Jinja template and feed it with parameters. I would have done something like the following

class TemplatedSQLQueryDataSet(SQLQueryDataSet):
    def __init__(
        self,
        templated_query_filepath: str,
        parameters_path: str,
        credentials: Optional[Dict[str, Any]] = None,
        load_args: Optional[Dict[str, Any]] = None,
        fs_args: Optional[Dict[str, Any]] = None,
    ) -> None:

        if parameters_path is not None:
            self._query_parameters = JSONDataSet(filepath=parameters_path).load()

        super().__init__(
            "",
            credentials or {},
            load_args or {},
            fs_args or {},
            templated_query_filepath,
        )

    def _load(self) -> pd.DataFrame:
        load_args = copy.deepcopy(self._load_args)
        engine = self.engines[self._connection_str]  # type: ignore

        if self._filepath:
            load_path = get_filepath_str(PurePosixPath(self._filepath), self._protocol)
            with self._fs.open(load_path, mode="r") as fs_file:
                template = fs_file.read()

        load_args["sql"] = Template(source=template).render(**self._query_parameters)
        return pd.read_sql_query(con=engine, **load_args)

While this works when running a pipeline or using kedro ipython, it potentially breaks when used in a notebook under notebooks with a catalog entry like

whatever:
  type: my_project.extras.datasets.templated_sql_dataset.TemplatedSQLQueryDataSet
  templated_query_filepath: data/01_raw/template.sql
  parameters_path: data/03_primary/parameters.json

because the relative paths are not properly parsed when loading the context.

One can solve this particular example by replacing the two paths by filepath and path for example which belong to the list of parsed keys mentioned at the top of this issue. In terms of user experience, it is nevertheless pretty unintuitive. Another possibility is to have nested elements like "template": {"filepath": ...}.

Possible Implementation

Could we use something like a regex containing path instead of a rigid list?

Thanks in advance!

PetitLepton avatar Oct 15 '22 09:10 PetitLepton

Hi @PetitLepton, we haven't heard this request before, but we'd be happy to accept a contribution for it 🙂

merelcht avatar Nov 14 '22 16:11 merelcht

Coming back to this. The offending code is

https://github.com/kedro-org/kedro/blob/e8f1bfd72992336ec12591b49a5fa2654217472f/kedro/framework/context/context.py#L91-L92

We spoke about this at length when debating https://github.com/kedro-org/kedro/issues/2965 and related "Kedro as a library" issues.

The current hardcoded list is less than ideal, but this "magic" behavior is deeply ingrained in Kedro at the moment, and the only realistic way to move past it is to introduce variables in the filepaths so that the user is in control:

whatever:
  type: my_project.extras.datasets.templated_sql_dataset.TemplatedSQLQueryDataSet
  filepath: ${kedro:project_root}/data/01_raw/template.sql

I think the proposed solution, namely

Could we use something like a regex containing path instead of a rigid list?

Goes in the opposite direction of expanding this behavior.

If you want to avoid mangling there's a workaround which is not using filepath. If you do want some special mangling, you can potentially achieve it using a custom resolver (haven't actually verified it, so if it turns out it's impossible, would be useful to know).

This probably deserves some more discussion.

astrojuanlu avatar Nov 24 '23 14:11 astrojuanlu

Hi @PetitLepton, @astrojuanlu - I believe #3561 resolves this issue and gets rid of the hardcoding. We now use relative paths across the board, does the example above still have issues?

AhdraMeraliQB avatar Feb 06 '24 11:02 AhdraMeraliQB

I think this issue is still there:

https://github.com/kedro-org/kedro/blob/f54c6fbcdb4284c6d8ef15de20f7ff719fa281ec/kedro/framework/context/context.py#L93-L94

Recently I saw a user doing lots of

dataset:
  type: ...
  filepath: ${base_path}/dir/file.parquet

so it's a pattern that our users are putting in the wild now.

Instead of "let's extend or make configurable the keys that receive a special treatment", I think we should find ways to support something like this:

whatever:
  type: my_project.extras.datasets.templated_sql_dataset.TemplatedSQLQueryDataSet
  filepath: ${kedro:project_root}/data/01_raw/template.sql

which would also solve @PetitLepton original request in https://github.com/kedro-org/kedro/issues/1934#issue-1410111292.

This essentially goes back to option 3 in our original debate https://github.com/kedro-org/kedro/issues/2924#issuecomment-1690280993, imitating what Intake does https://intake.readthedocs.io/en/latest/catalog.html#yaml-format, and @yetudada's question here https://github.com/kedro-org/kedro/issues/2819#issuecomment-1744787202 which we addressed by adding more docs https://github.com/kedro-org/kedro/issues/3247.

astrojuanlu avatar Feb 09 '24 17:02 astrojuanlu