kedro Lazy Loading of Catalog Items

Hi everyone,

Currently, running modular pipelines requires loading all datasets before pipeline executions, which, to me, seems not really ideal...

Let's assume a simple project, with 2 modular pipelines, A and B.

If B requires datasets created via api calls or interactions with remote databases , then, when working offline, one cannot do a simple kedro run --pipeline A, since Kedro will fail at loading B's dataset... which are not needed to run A ! A bit of a pity...

A first, impulsive reaction might lead to commenting out those un-needed datasets in the local conf, but since commentaries are not evaluated, this would simply 'do nothing'...

Granted, one could simply comment out those datasets in the base config... But "messing around" with base to handle a temporary / local need seems, to me at least, to deviate from the "Software engineering best practices" that kedro aims at fostering.

To avoid this, one would therefore have to create some sort of dummy dataset in the local conf... Totally do-able, but not super convenient and definitely not frictionless...

I guess that such "troubles" could be avoided if datasets were loaded lazily, only when needed, or by building the DAG first and then filtering datasets that are not actually needed...

I hope that this suggestion will both sound relevant and won't represent much of a technical challenge.

Best Regards Marc

Jul 22 '23 17:07 m-gris

Aleksander Jaworski 1 hour ago [Kedro-version : 0.18.6 currently] Hi, I am working on a sort of 'pipeline monorepo' where I have dozens of pipelines. I have a question: would some sort of lazy-configuration-validation be a useful feature for kedro? I have 2 reasons for asking: It feels a bit cumbersome that even a simple hello_world.py will take several seconds to run when the configuration is large enough, as first you will see all the logs and all the setup will be done for the data catalog etc, none of which would actually end up being used in a hello_world.py When setting up the project for someone, it is impossible to provide a credentials file with just the required credentials. In kedro all of them need to be filled right now as it is all validated at once. In a sort of lazy version, only the dependencies that follow from the pipeline would need to be evaluated. Are there any solutions or modifications I could use to improve my approaches here? Thanks in advance! :) :kedroid-party: 1

I ask the user to comment out any SQLDataSet to see how it impacts performance. It shows that it contribute to a significant amount of time.

~21 secs for kedro catalog list ~6 secs after commenting out the SQL and GBQ datasets

Jul 24 '23 12:07 noklam

To scope the discussion better.

Maybe focus on SQLDataSet (and any dataset initial connections at initialisation)
This old PR changes all connections to be singleton - https://github.com/kedro-org/kedro/pull/1163
The above comment also suggested that this create significant overhead to run kedro projects (particular with bigger one), from 21 seconds -> 6 seconds.

Jul 24 '23 13:07 noklam

Hey @m-gris! Which dataset were you facing an issue with on https://github.com/kedro-org/kedro/issues/2829? I think I've "fixed" it in https://github.com/kedro-org/kedro-plugins/pull/281 (just for pandas.SQLTableDataSet; I haven't applied it to other instances yet). If it's that one, maybe you can try it out and see if it fixes your issue. :slightly_smiling_face: In the meantime, I'll work on adding it across other datasets.

To test, I modified Spaceflights to make companies a pandas.SQLTableDataSet:

# conf/base/catalog.yml
companies:
   type: pandas.SQLTableDataSet
   credentials: db_credentials
   table_name: shuttles
   load_args:
     schema: dwschema
   save_args:
     schema: dwschema
     if_exists: replace

# conf/local/credentials.yml
db_credentials:
   con: postgresql://scott:tiger@localhost/test

This dataset doesn't exist, so kedro run fails, but kedro run --pipeline data_science still works (if you've previously generated model_input_table). (You still do need to have the dependencies for the dataset installed; e.g. psycopg2).

Jul 28 '23 16:07 deepyaman

Hey @m-gris! Which dataset were you facing an issue with on #2829? I think I've "fixed" it in kedro-org/kedro-plugins#281 (just for pandas.SQLTableDataSet; I haven't applied it to other instances yet). If it's that one, maybe you can try it out and see if it fixes your issue. 🙂 In the meantime, I'll work on adding it across other datasets.

To test, I modified Spaceflights to make companies a pandas.SQLTableDataSet:
# conf/base/catalog.yml
companies:
   type: pandas.SQLTableDataSet
   credentials: db_credentials
   table_name: shuttles
   load_args:
     schema: dwschema
   save_args:
     schema: dwschema
     if_exists: replace

# conf/local/credentials.yml
db_credentials:
   con: postgresql://scott:tiger@localhost/test
This dataset doesn't exist, so kedro run fails, but kedro run --pipeline data_science still works (if you've previously generated model_input_table). (You still do need to have the dependencies for the dataset installed; e.g. psycopg2).

Confirmed offline with @m-gris that https://github.com/kedro-org/kedro-plugins/pull/281 works for him, so I'll get it ready for review when I get a chance! :)

Aug 10 '23 15:08 deepyaman

Hi everyone,

Apologies for my silence / lack of reactivity & Thanks to Deepyaman for reaching out and coming up with a fix for this issue.

I'm happy to confirm that it does work, in the sense that I can run pipeline A, even if some sql datasets needed for pipeline B can’t be loaded because I’m offline / without access to the DB.

However... With further tests I noticed something that, to me at least, seems not ideal: The con is still needed even if not used.

I know… I’m being a bit picky. But here is the case I have in mind: A team member working on a new modular pipeline C created a new sql dataset in the catalog. However this being a pipeline I do not work on, I do not have a valid con to provide. Granted, I could hack my way around by passing a dummy uri...But this feels a bit odd / not ideal...

Instead of putting garbage in the conf wouldn’t it be better to "simply" have the ability to leave con empty until I really do need it ?

Thanks in advance, Regards M

Aug 12 '23 06:08 m-gris

However... With further tests I noticed something that, to me at least, seems not ideal: The con is still needed even if not used.

I know… I’m being a bit picky. But here is the case I have in mind: A team member working on a new modular pipeline C created a new sql dataset in the catalog. However this being a pipeline I do not work on, I do not have a valid con to provide. Granted, I could hack my way around by passing a dummy uri...But this feels a bit odd / not ideal...

Hi again @m-gris! Based on just my evaluation, I think what you're requesting is feasible, and looks reasonably straightforward. When constructing DataCatalog.from_config(), the parsed dataset configuration is materialized (see https://github.com/kedro-org/kedro/blob/0.18.12/kedro/io/data_catalog.py#L299-L301). Instead of this, it should be possible to just store the dataset configuration, and construct the dataset instance the first time it's referenced (using a lazy-loading approach, like you suggest).

However... I do think the change could have a significant effect, potentially breaking things (e.g. plugins) that parse the data catalog. I'm almost certain this would be a breaking change that would have to go in 0.19 or a future minor release. Additionally, users may benefit from the eager error reporting, but perhaps the lazy loading could be enabled as an option (reminiscent of the discussion about find_pipelines() in https://github.com/kedro-org/kedro/issues/2910).

Let me request some others view on this, especially given a lot of people have been looking at the data catalog lately.

P.S. Re "I could hack my way around by passing a dummy uri...But this feels a bit odd / not ideal...", my first impression was, this was what we would do back when I was primarily a Kedro user myself. We'd usually define a dummy uri with the template in the base environment (e.g. postgresql://<username>:<password>@localhost/test), and team members would put the actual credentials following the template in their local environment (e.g. postgresql://deepyaman:5hhhhhh!@localhost/test). So, I don't 100% know if I find it that odd myself. That being said, that's also a specific example, and I don't think it more broadly addresses the lazy loading request, hence I'm just including it as an aside. :)

Aug 15 '23 04:08 deepyaman

Thx for your answer. Good point regarding regarding the potential for breaking changes.
Local forward what others will have to say :)

Aug 15 '23 06:08 m-gris

As far as I understand, the whole problem boils down to connection is constructed(some SQL or db related dataset) when datacalog is materialised.

This look like a dataset issue to me and we should fix dataset instead of making DataCatalog lazily load. Regard to the problem of con, isn't this simply because the dataset assume credentials always come with a con key?

Would this be enough? self._connection_str = credentials.get("con", None)

Aug 15 '23 14:08 noklam

As far as I understand, the whole problem boils down to connection is constructed(some SQL or db related dataset) when datacalog is materialised.

This would be addressed by my open PR. I think there are definite benefits to doing this (e.g. not unnecessarily creating connections well before they need to be used, if at all.

This look like a dataset issue to me and we should fix dataset instead of making DataCatalog lazily load. Regard to the problem of con, isn't this simply because the dataset assume credentials always come with a con key?

Would this be enough? self._connection_str = credentials.get("con", None)

Nit: There's also a block further up, to validate the con argument, that would need to be moved or turned off.

But, more broadly, I don't know how I feel about this. I feel like providing con is required (in the dataset credentials, but it's also a positional argument for pd.read_sql_table()). Why would it be a special case to delay validation that just con is provided? What if I don't know where my colleague put some local CSV file for a pandas.CSVDataSet; why does my catalog validation fail if I don't have a filepath to provide?

Aug 15 '23 17:08 deepyaman

We are trying to isolate catalogs within Kedro based on pipelines. What this means is: pipeline A should use catalog A, pipeline B should use catalog B. I have used Kedro and used multiple catalogs but it seems like all catalog files (with the prefix catalog ) get read in regardless of which pipeline is actually run.

More context/background: The DE pipeline is set up to read from an Oracle database and create data extracts. The only way to go ahead is with data extracts as we have READ-ONLY access on the database and the DE pipeline needs to do some transformations before the data is ready for DS. Now once these transformations are done, these extracts are ready for use by DS. This means, that they don't really need to use the DE pipeline again, and they do run only their specific pipeline. However, even while doing this, the entire catalog ends up being read. One of the catalog entries is the Oracle database and it requires setting up credentials in the local environment, and setting up the required dependencies for cx_Oracle/oracledb. DS doesn't need these and they do not want to invest time or effort in having to set this up, so they want to isolate their catalog.

Posting at the request of @astrojuanlu for feature prioritization (if actually determined to be a needed feature)

Aug 16 '23 17:08 rishabhmulani-mck

Thanks a lot @m-gris for opening the issue and @rishabhmulani-mck for sharing your use case.

One quick comment: is using environments a decent workaround? https://docs.kedro.org/en/stable/configuration/configuration_basics.html#how-to-specify-additional-configuration-environments One that contains the "problematic"/"heavy" datasets, and another one that is isolated from it. Then one could do kedro run --env=desired-env and avoid the accidental loading of the database connection entirely.

Aug 16 '23 17:08 astrojuanlu

However, even while doing this, the entire catalog ends up being read. One of the catalog entries is the Oracle database and it requires setting up credentials in the local environment, and setting up the required dependencies for cx_Oracle/oracledb. DS doesn't need these and they do not want to invest time or effort in having to set this up, so they want to isolate their catalog.

@rishabhmulani-mck I think we all agree that it's a problem to connect to all of the databases upon parsing the catalog, which would be resolved by something like https://github.com/kedro-org/kedro-plugins/pull/281. I can finish this PR.

There is a second question of con being a required field. Would your DS care if a dummy con was in the base environment (thereby satisfying the constraint of having a valid config, but it won't be loaded if https://github.com/kedro-org/kedro-plugins/pull/281 is implemented).

@astrojuanlu @noklam my opinion is to split this up, and finish fleshing out https://github.com/kedro-org/kedro-plugins/pull/281 (I believe it's an improvement regardless, can be added for all the datasets with a DB connection), and decide in the meantime whether to support anything further. Unless it's just me, I think making con essentially optional is more contentious.

Aug 21 '23 15:08 deepyaman

That's fine with me, and it is an improvement regardless. We can keep this ticket open but merge the PR you made.

Aug 21 '23 23:08 noklam

I don't think we have spent enough time thinking about how to solve this in a generic way for all datasets beyond database connections. The reason I'm saying this is because if we come up with a generic solution (like "let's not instantiate the dataset classes until they're needed"?) then maybe https://github.com/kedro-org/kedro-plugins/pull/281 is not even needed.

I'd like to hear @merelcht perspective when she's back in a few days.

Aug 22 '23 08:08 astrojuanlu

Since my slack message was quoted above, I'll add my 2 cents for what it's worth.

I have in the end solved it also using a templated uri string, but was left slightly unsatisfied, as there is just "no need" to construct everything (or there doesnt seem to be) and it both impacts performance and ease of use.

Having the whole configuration valid while perhaps desirable, it is kind of annoying to not be able to run an unrelated pipeline perhaps during development. "Why would X break Y???" reaction is what I had personally. So in this case I would love to also see lazy loading of only the necessary config components, but I am not that knowledgeable in Kedro and what sort of things it'd break.

Aug 22 '23 12:08 AlekJaworski

That's fine with me, and it is an improvement regardless. We can keep this ticket open but merge the PR you made.

Great! I will unassigned myself from this issue (and it's larger scope), but finish the other PR when I get a chance. :)

Aug 22 '23 15:08 deepyaman

Link https://github.com/kedro-org/kedro-plugins/issues/281 which is a partial solution to this issue.

Aug 29 '23 10:08 noklam

Currently the catalog is lazily instantiated on a session run: https://github.com/kedro-org/kedro/blob/0fd1cac9e9211960a893ffc3602bee3fd8df9241/kedro/framework/session/session.py#L417-L420 (not sure why this is not using context.catalog and reaching for an internal method, but this is a side question)

It should be possible to add an argument to context._get_catalog(..., needed=pipeline.data_sets()) (if it won't create any weird implications) that will filter the conf_catalog before it creates the DataCatalog object with the list of datasets from the pipeline.

https://github.com/kedro-org/kedro/blob/0fd1cac9e9211960a893ffc3602bee3fd8df9241/kedro/framework/context/context.py#L269

Something like this:

conf_catalog = {k: conf_catalog[k] for k in conf_catalog if k in needed}

Parameter names and points of interjection should be more carefully examined though, since this is just a sketch solution. The good news is that it's a non-breaking one, if we implement it.

Sep 14 '23 15:09 idanov

(not sure why this is not using context.catalog and reaching for an internal method, but this is a side question)

@idanov I think that's because save_version and load_version are only available in session but not context? Sure we can pass all these information down to KedroContext and I don't see a big problem with that. The weird bit may be save_version is also session_id and it make more sense to have it in KedroSession.

Noted @ankatiyar added the save_version and load_version into DataCatalog for data factories. (This is relatvely new), so catalog actually has sufficient information to infer the version now and doesn't need to rely on context/session anymore. https://github.com/kedro-org/kedro/pull/2635

Sep 18 '23 10:09 noklam

This is really something needed when the codebase grow or when we do Monorepo with many pipelines that need/could be orchestrated independently with different scheduling contexts (for example : training, evaluation and inference pipelines).

Currenly when our codebase grow, we create multiple packages inside the same kedro project (src/package1, src/package2, ...), which means multiple kedro contexts, one context per "deployabe" pipelines. So we can have a "space" to manage configs that belong to the same execution context. However this increase our cognitive context switching and duplication of boilerplate code.

I wonder if the proposed solution solve entirely the problem, as it's not just about initializing datasets lazily, but somehow about materializing the dataset lazily. The catalog will still be entirely materialized from yaml to python object by the config_loader. kedro will still ask for some globals that are not filled but not really needed by the selected pipeline (in user perspectif).

Maybe pushing a little further the namespace concept could solve this. But for now, the poposed solution below (lazily instantiate datasets) is a big step toward pipeline modularity and monorepos, and have the merit of being no breaking. I'm looking forward to it.

It should be possible to add an argument to context._get_catalog(..., needed=pipeline.data_sets()) (if it won't create any weird implications) that will filter the conf_catalog before it creates the DataCatalog object with the list of datasets from the pipeline.

https://github.com/kedro-org/kedro/blob/0fd1cac9e9211960a893ffc3602bee3fd8df9241/kedro/framework/context/context.py#L269

Something like this:
conf_catalog = {k: conf_catalog[k] for k in conf_catalog if k in needed}

Oct 02 '23 16:10 takikadiri

https://github.com/kedro-org/kedro-plugins/pull/281 addressed lazy loading of database connections. Should we keep this issue open for future work on lazy loading of catalog items in general? @merelcht

Oct 19 '23 05:10 astrojuanlu

kedro-org/kedro-plugins#281 addressed lazy loading of database connections. Should we keep this issue open for future work on lazy loading of catalog items in general? @merelcht

Yeah, we should; I'm not really sure why this got closed in #281, as that was linked to #366 and not this...

Oct 23 '23 13:10 deepyaman

We're hitting this issue again, as we tried to deploy two pipelines that have slightly differents configs. When we run pipeline A, Kedro keep asking for a credential that is only used in pipeline B.

Currently we're giving fake credentials to the target orchestration Patform, so the pipeline could run. But this introduce some debt in our credentials management in the orchestration platform.

Apr 08 '24 11:04 takikadiri

Adding this to the "Redesign the API for IO (catalog)" for visibility.

May 27 '24 07:05 astrojuanlu

Noting that this was also reported as #3804.

May 27 '24 07:05 astrojuanlu

I believed the original issue was resolved by fixing the lazy connection in SQL-related dataset already. There seems to be some other issue raised by @takikadiri

Jul 23 '24 11:07 noklam

It wasn't, generic datasets are still eagerly loaded

Jul 25 '24 08:07 astrojuanlu

We're hitting this issue again, as we tried to deploy two pipelines that have slightly differents configs. When we run pipeline A, Kedro keep asking for a credential that is only used in pipeline B.

Currently we're giving fake credentials to the target orchestration Patform, so the pipeline could run. But this introduce some debt in our credentials management in the orchestration platform.

@takikadiri We also run a monorepo, with several overlapping projects/pipelines. We segment the projects by subfolder both in src code and parameters, pipelines and datacatalog. We control this via env vars that are read in settings.py (catalog and parameters) and pipeline_registry.py for pipelines. There are also universal/general catalog and parameters that are loaded for all projects/pipelines. This way, a session/context is created with only the catalog and parameters that are needed for the pipeline that we want to run and a developer working within one project cannot break another project.

However, we have the opposite problem when it comes to testing, we cannot iterate through different sessions to test the makeup of the catalogs because subsequent sessions created in code (confirmed different session ids, etc.) for some reason still contains the first context. see: #4087

Aug 14 '24 15:08 jround1

kedro kedro copied to clipboard

Lazy Loading of Catalog Items

kedro
kedro copied to clipboard