kedro
kedro copied to clipboard
Extend path-like keys in the catalog
Description
In the current implementation of the parsing of the context, the function _convert_paths_to_absolute_posix
— which converts relative to absolute paths — restrict the conversion to three types of keys, namely
# only check a few conf keys that are known to specify a path string as value
conf_keys_with_filepath = ("filename", "filepath", "path")
This is very restrictive and can lead to unexpected issues while implementing custom data set as explained in the example below.
Context
Let's imagine that I want to extend the SQLQueryDataSet
by using a Jinja template and feed it with parameters. I would have done something like the following
class TemplatedSQLQueryDataSet(SQLQueryDataSet):
def __init__(
self,
templated_query_filepath: str,
parameters_path: str,
credentials: Optional[Dict[str, Any]] = None,
load_args: Optional[Dict[str, Any]] = None,
fs_args: Optional[Dict[str, Any]] = None,
) -> None:
if parameters_path is not None:
self._query_parameters = JSONDataSet(filepath=parameters_path).load()
super().__init__(
"",
credentials or {},
load_args or {},
fs_args or {},
templated_query_filepath,
)
def _load(self) -> pd.DataFrame:
load_args = copy.deepcopy(self._load_args)
engine = self.engines[self._connection_str] # type: ignore
if self._filepath:
load_path = get_filepath_str(PurePosixPath(self._filepath), self._protocol)
with self._fs.open(load_path, mode="r") as fs_file:
template = fs_file.read()
load_args["sql"] = Template(source=template).render(**self._query_parameters)
return pd.read_sql_query(con=engine, **load_args)
While this works when running a pipeline or using kedro ipython
, it potentially breaks when used in a notebook under notebooks
with a catalog entry like
whatever:
type: my_project.extras.datasets.templated_sql_dataset.TemplatedSQLQueryDataSet
templated_query_filepath: data/01_raw/template.sql
parameters_path: data/03_primary/parameters.json
because the relative paths are not properly parsed when loading the context.
One can solve this particular example by replacing the two paths by filepath
and path
for example which belong to the list of parsed keys mentioned at the top of this issue. In terms of user experience, it is nevertheless pretty unintuitive. Another possibility is to have nested elements like "template": {"filepath": ...}
.
Possible Implementation
Could we use something like a regex containing path
instead of a rigid list?
Thanks in advance!
Hi @PetitLepton, we haven't heard this request before, but we'd be happy to accept a contribution for it 🙂
Coming back to this. The offending code is
https://github.com/kedro-org/kedro/blob/e8f1bfd72992336ec12591b49a5fa2654217472f/kedro/framework/context/context.py#L91-L92
We spoke about this at length when debating https://github.com/kedro-org/kedro/issues/2965 and related "Kedro as a library" issues.
The current hardcoded list is less than ideal, but this "magic" behavior is deeply ingrained in Kedro at the moment, and the only realistic way to move past it is to introduce variables in the filepaths so that the user is in control:
whatever:
type: my_project.extras.datasets.templated_sql_dataset.TemplatedSQLQueryDataSet
filepath: ${kedro:project_root}/data/01_raw/template.sql
I think the proposed solution, namely
Could we use something like a regex containing path instead of a rigid list?
Goes in the opposite direction of expanding this behavior.
If you want to avoid mangling there's a workaround which is not using filepath
. If you do want some special mangling, you can potentially achieve it using a custom resolver (haven't actually verified it, so if it turns out it's impossible, would be useful to know).
This probably deserves some more discussion.
Hi @PetitLepton, @astrojuanlu - I believe #3561 resolves this issue and gets rid of the hardcoding. We now use relative paths across the board, does the example above still have issues?
I think this issue is still there:
https://github.com/kedro-org/kedro/blob/f54c6fbcdb4284c6d8ef15de20f7ff719fa281ec/kedro/framework/context/context.py#L93-L94
Recently I saw a user doing lots of
dataset:
type: ...
filepath: ${base_path}/dir/file.parquet
so it's a pattern that our users are putting in the wild now.
Instead of "let's extend or make configurable the keys that receive a special treatment", I think we should find ways to support something like this:
whatever:
type: my_project.extras.datasets.templated_sql_dataset.TemplatedSQLQueryDataSet
filepath: ${kedro:project_root}/data/01_raw/template.sql
which would also solve @PetitLepton original request in https://github.com/kedro-org/kedro/issues/1934#issue-1410111292.
This essentially goes back to option 3 in our original debate https://github.com/kedro-org/kedro/issues/2924#issuecomment-1690280993, imitating what Intake does https://intake.readthedocs.io/en/latest/catalog.html#yaml-format, and @yetudada's question here https://github.com/kedro-org/kedro/issues/2819#issuecomment-1744787202 which we addressed by adding more docs https://github.com/kedro-org/kedro/issues/3247.