kedro
kedro copied to clipboard
Lazy Loading of Catalog Items
Hi everyone,
Currently, running modular pipelines requires loading all datasets before pipeline executions, which, to me, seems not really ideal...
Let's assume a simple project, with 2 modular pipelines, A and B.
If B requires datasets created via api calls or interactions with remote databases , then, when working offline, one cannot do a simple kedro run --pipeline A
, since Kedro will fail at loading B's dataset... which are not needed to run A !
A bit of a pity...
A first, impulsive reaction might lead to commenting out those un-needed datasets in the local conf, but since commentaries are not evaluated, this would simply 'do nothing'...
Granted, one could simply comment out those datasets in the base config... But "messing around" with base to handle a temporary / local need seems, to me at least, to deviate from the "Software engineering best practices" that kedro aims at fostering.
To avoid this, one would therefore have to create some sort of dummy dataset in the local conf... Totally do-able, but not super convenient and definitely not frictionless...
I guess that such "troubles" could be avoided if datasets were loaded lazily, only when needed, or by building the DAG first and then filtering datasets that are not actually needed...
I hope that this suggestion will both sound relevant and won't represent much of a technical challenge.
Best Regards Marc
Aleksander Jaworski 1 hour ago [Kedro-version : 0.18.6 currently] Hi, I am working on a sort of 'pipeline monorepo' where I have dozens of pipelines. I have a question: would some sort of lazy-configuration-validation be a useful feature for kedro? I have 2 reasons for asking: It feels a bit cumbersome that even a simple hello_world.py will take several seconds to run when the configuration is large enough, as first you will see all the logs and all the setup will be done for the data catalog etc, none of which would actually end up being used in a hello_world.py When setting up the project for someone, it is impossible to provide a credentials file with just the required credentials. In kedro all of them need to be filled right now as it is all validated at once. In a sort of lazy version, only the dependencies that follow from the pipeline would need to be evaluated. Are there any solutions or modifications I could use to improve my approaches here? Thanks in advance! :) :kedroid-party: 1
I ask the user to comment out any SQLDataSet to see how it impacts performance. It shows that it contribute to a significant amount of time.
~21 secs for kedro catalog list ~6 secs after commenting out the SQL and GBQ datasets
To scope the discussion better.
- Maybe focus on SQLDataSet (and any dataset initial connections at initialisation)
- This old PR changes all connections to be singleton - https://github.com/kedro-org/kedro/pull/1163
- The above comment also suggested that this create significant overhead to run kedro projects (particular with bigger one), from 21 seconds -> 6 seconds.
Hey @m-gris! Which dataset were you facing an issue with on https://github.com/kedro-org/kedro/issues/2829? I think I've "fixed" it in https://github.com/kedro-org/kedro-plugins/pull/281 (just for pandas.SQLTableDataSet
; I haven't applied it to other instances yet). If it's that one, maybe you can try it out and see if it fixes your issue. :slightly_smiling_face: In the meantime, I'll work on adding it across other datasets.
To test, I modified Spaceflights to make companies
a pandas.SQLTableDataSet
:
# conf/base/catalog.yml
companies:
type: pandas.SQLTableDataSet
credentials: db_credentials
table_name: shuttles
load_args:
schema: dwschema
save_args:
schema: dwschema
if_exists: replace
# conf/local/credentials.yml
db_credentials:
con: postgresql://scott:tiger@localhost/test
This dataset doesn't exist, so kedro run
fails, but kedro run --pipeline data_science
still works (if you've previously generated model_input_table
). (You still do need to have the dependencies for the dataset installed; e.g. psycopg2
).
Hey @m-gris! Which dataset were you facing an issue with on #2829? I think I've "fixed" it in kedro-org/kedro-plugins#281 (just for
pandas.SQLTableDataSet
; I haven't applied it to other instances yet). If it's that one, maybe you can try it out and see if it fixes your issue. 🙂 In the meantime, I'll work on adding it across other datasets.To test, I modified Spaceflights to make
companies
apandas.SQLTableDataSet
:# conf/base/catalog.yml companies: type: pandas.SQLTableDataSet credentials: db_credentials table_name: shuttles load_args: schema: dwschema save_args: schema: dwschema if_exists: replace # conf/local/credentials.yml db_credentials: con: postgresql://scott:tiger@localhost/test
This dataset doesn't exist, so
kedro run
fails, butkedro run --pipeline data_science
still works (if you've previously generatedmodel_input_table
). (You still do need to have the dependencies for the dataset installed; e.g.psycopg2
).
Confirmed offline with @m-gris that https://github.com/kedro-org/kedro-plugins/pull/281 works for him, so I'll get it ready for review when I get a chance! :)
Hi everyone,
Apologies for my silence / lack of reactivity & Thanks to Deepyaman for reaching out and coming up with a fix for this issue.
I'm happy to confirm that it does work, in the sense that I can run pipeline A, even if some sql datasets needed for pipeline B can’t be loaded because I’m offline / without access to the DB.
However... With further tests I noticed something that, to me at least, seems not ideal:
The con
is still needed even if not used.
I know… I’m being a bit picky.
But here is the case I have in mind:
A team member working on a new modular pipeline C created a new sql dataset in the catalog.
However this being a pipeline I do not work on, I do not have a valid con
to provide.
Granted, I could hack my way around by passing a dummy uri
...But this feels a bit odd / not ideal...
Instead of putting garbage in the conf
wouldn’t it be better to "simply" have the ability to leave con
empty until I really do need it ?
Thanks in advance, Regards M
However... With further tests I noticed something that, to me at least, seems not ideal: The
con
is still needed even if not used.I know… I’m being a bit picky. But here is the case I have in mind: A team member working on a new modular pipeline C created a new sql dataset in the catalog. However this being a pipeline I do not work on, I do not have a valid
con
to provide. Granted, I could hack my way around by passing a dummyuri
...But this feels a bit odd / not ideal...
Hi again @m-gris! Based on just my evaluation, I think what you're requesting is feasible, and looks reasonably straightforward. When constructing DataCatalog.from_config()
, the parsed dataset configuration is materialized (see https://github.com/kedro-org/kedro/blob/0.18.12/kedro/io/data_catalog.py#L299-L301). Instead of this, it should be possible to just store the dataset configuration, and construct the dataset instance the first time it's referenced (using a lazy-loading approach, like you suggest).
However... I do think the change could have a significant effect, potentially breaking things (e.g. plugins) that parse the data catalog. I'm almost certain this would be a breaking change that would have to go in 0.19 or a future minor release. Additionally, users may benefit from the eager error reporting, but perhaps the lazy loading could be enabled as an option (reminiscent of the discussion about find_pipelines()
in https://github.com/kedro-org/kedro/issues/2910).
Let me request some others view on this, especially given a lot of people have been looking at the data catalog lately.
P.S. Re "I could hack my way around by passing a dummy uri
...But this feels a bit odd / not ideal...", my first impression was, this was what we would do back when I was primarily a Kedro user myself. We'd usually define a dummy uri
with the template in the base
environment (e.g. postgresql://<username>:<password>@localhost/test
), and team members would put the actual credentials following the template in their local
environment (e.g. postgresql://deepyaman:5hhhhhh!@localhost/test
). So, I don't 100% know if I find it that odd myself. That being said, that's also a specific example, and I don't think it more broadly addresses the lazy loading request, hence I'm just including it as an aside. :)
Thx for your answer. Good point regarding regarding the potential for breaking changes.
Local forward what others will have to say :)
As far as I understand, the whole problem boils down to connection is constructed(some SQL or db related dataset) when datacalog is materialised.
This look like a dataset issue to me and we should fix dataset instead of making DataCatalog
lazily load. Regard to the problem of con
, isn't this simply because the dataset assume credentials
always come with a con
key?
Would this be enough?
self._connection_str = credentials.get("con", None)
As far as I understand, the whole problem boils down to connection is constructed(some SQL or db related dataset) when datacalog is materialised.
This would be addressed by my open PR. I think there are definite benefits to doing this (e.g. not unnecessarily creating connections well before they need to be used, if at all.
This look like a dataset issue to me and we should fix dataset instead of making
DataCatalog
lazily load. Regard to the problem ofcon
, isn't this simply because the dataset assumecredentials
always come with acon
key?Would this be enough?
self._connection_str = credentials.get("con", None)
Nit: There's also a block further up, to validate the con
argument, that would need to be moved or turned off.
But, more broadly, I don't know how I feel about this. I feel like providing con
is required (in the dataset credentials, but it's also a positional argument for pd.read_sql_table()
). Why would it be a special case to delay validation that just con
is provided? What if I don't know where my colleague put some local CSV file for a pandas.CSVDataSet
; why does my catalog validation fail if I don't have a filepath
to provide?
We are trying to isolate catalogs within Kedro based on pipelines. What this means is: pipeline A should use catalog A, pipeline B should use catalog B. I have used Kedro and used multiple catalogs but it seems like all catalog files (with the prefix catalog ) get read in regardless of which pipeline is actually run.
More context/background: The DE pipeline is set up to read from an Oracle database and create data extracts. The only way to go ahead is with data extracts as we have READ-ONLY access on the database and the DE pipeline needs to do some transformations before the data is ready for DS. Now once these transformations are done, these extracts are ready for use by DS. This means, that they don't really need to use the DE pipeline again, and they do run only their specific pipeline. However, even while doing this, the entire catalog ends up being read. One of the catalog entries is the Oracle database and it requires setting up credentials in the local environment, and setting up the required dependencies for cx_Oracle/oracledb. DS doesn't need these and they do not want to invest time or effort in having to set this up, so they want to isolate their catalog.
Posting at the request of @astrojuanlu for feature prioritization (if actually determined to be a needed feature)
Thanks a lot @m-gris for opening the issue and @rishabhmulani-mck for sharing your use case.
One quick comment: is using environments a decent workaround? https://docs.kedro.org/en/stable/configuration/configuration_basics.html#how-to-specify-additional-configuration-environments One that contains the "problematic"/"heavy" datasets, and another one that is isolated from it. Then one could do kedro run --env=desired-env
and avoid the accidental loading of the database connection entirely.
However, even while doing this, the entire catalog ends up being read. One of the catalog entries is the Oracle database and it requires setting up credentials in the local environment, and setting up the required dependencies for cx_Oracle/oracledb. DS doesn't need these and they do not want to invest time or effort in having to set this up, so they want to isolate their catalog.
@rishabhmulani-mck I think we all agree that it's a problem to connect to all of the databases upon parsing the catalog, which would be resolved by something like https://github.com/kedro-org/kedro-plugins/pull/281. I can finish this PR.
There is a second question of con
being a required field. Would your DS care if a dummy con
was in the base
environment (thereby satisfying the constraint of having a valid config, but it won't be loaded if https://github.com/kedro-org/kedro-plugins/pull/281 is implemented).
@astrojuanlu @noklam my opinion is to split this up, and finish fleshing out https://github.com/kedro-org/kedro-plugins/pull/281 (I believe it's an improvement regardless, can be added for all the datasets with a DB connection), and decide in the meantime whether to support anything further. Unless it's just me, I think making con
essentially optional is more contentious.
That's fine with me, and it is an improvement regardless. We can keep this ticket open but merge the PR you made.
I don't think we have spent enough time thinking about how to solve this in a generic way for all datasets beyond database connections. The reason I'm saying this is because if we come up with a generic solution (like "let's not instantiate the dataset classes until they're needed"?) then maybe https://github.com/kedro-org/kedro-plugins/pull/281 is not even needed.
I'd like to hear @merelcht perspective when she's back in a few days.
Since my slack message was quoted above, I'll add my 2 cents for what it's worth.
I have in the end solved it also using a templated uri string, but was left slightly unsatisfied, as there is just "no need" to construct everything (or there doesnt seem to be) and it both impacts performance and ease of use.
Having the whole configuration valid while perhaps desirable, it is kind of annoying to not be able to run an unrelated pipeline perhaps during development. "Why would X break Y???" reaction is what I had personally. So in this case I would love to also see lazy loading of only the necessary config components, but I am not that knowledgeable in Kedro and what sort of things it'd break.
That's fine with me, and it is an improvement regardless. We can keep this ticket open but merge the PR you made.
Great! I will unassigned myself from this issue (and it's larger scope), but finish the other PR when I get a chance. :)
Link https://github.com/kedro-org/kedro-plugins/issues/281 which is a partial solution to this issue.
Currently the catalog is lazily instantiated on a session run:
https://github.com/kedro-org/kedro/blob/0fd1cac9e9211960a893ffc3602bee3fd8df9241/kedro/framework/session/session.py#L417-L420
(not sure why this is not using context.catalog
and reaching for an internal method, but this is a side question)
It should be possible to add an argument to context._get_catalog(..., needed=pipeline.data_sets())
(if it won't create any weird implications) that will filter the conf_catalog
before it creates the DataCatalog
object with the list of datasets from the pipeline.
https://github.com/kedro-org/kedro/blob/0fd1cac9e9211960a893ffc3602bee3fd8df9241/kedro/framework/context/context.py#L269
Something like this:
conf_catalog = {k: conf_catalog[k] for k in conf_catalog if k in needed}
Parameter names and points of interjection should be more carefully examined though, since this is just a sketch solution. The good news is that it's a non-breaking one, if we implement it.
(not sure why this is not using context.catalog and reaching for an internal method, but this is a side question)
@idanov I think that's because save_version
and load_version
are only available in session
but not context
? Sure we can pass all these information down to KedroContext
and I don't see a big problem with that. The weird bit may be save_version
is also session_id
and it make more sense to have it in KedroSession
.
Noted @ankatiyar added the save_version
and load_version
into DataCatalog
for data factories. (This is relatvely new), so catalog actually has sufficient information to infer the version now and doesn't need to rely on context/session anymore.
https://github.com/kedro-org/kedro/pull/2635
This is really something needed when the codebase grow or when we do Monorepo with many pipelines that need/could be orchestrated independently with different scheduling contexts (for example : training, evaluation and inference pipelines).
Currenly when our codebase grow, we create multiple packages inside the same kedro project (src/package1, src/package2, ...), which means multiple kedro contexts, one context per "deployabe" pipelines. So we can have a "space" to manage configs that belong to the same execution context. However this increase our cognitive context switching and duplication of boilerplate code.
I wonder if the proposed solution solve entirely the problem, as it's not just about initializing datasets lazily, but somehow about materializing the dataset lazily. The catalog will still be entirely materialized from yaml to python object by the config_loader. kedro will still ask for some globals that are not filled but not really needed by the selected pipeline (in user perspectif).
Maybe pushing a little further the namespace concept could solve this. But for now, the poposed solution below (lazily instantiate datasets) is a big step toward pipeline modularity and monorepos, and have the merit of being no breaking. I'm looking forward to it.
It should be possible to add an argument to
context._get_catalog(..., needed=pipeline.data_sets())
(if it won't create any weird implications) that will filter theconf_catalog
before it creates theDataCatalog
object with the list of datasets from the pipeline.https://github.com/kedro-org/kedro/blob/0fd1cac9e9211960a893ffc3602bee3fd8df9241/kedro/framework/context/context.py#L269
Something like this:
conf_catalog = {k: conf_catalog[k] for k in conf_catalog if k in needed}
https://github.com/kedro-org/kedro-plugins/pull/281 addressed lazy loading of database connections. Should we keep this issue open for future work on lazy loading of catalog items in general? @merelcht
kedro-org/kedro-plugins#281 addressed lazy loading of database connections. Should we keep this issue open for future work on lazy loading of catalog items in general? @merelcht
Yeah, we should; I'm not really sure why this got closed in #281, as that was linked to #366 and not this...
We're hitting this issue again, as we tried to deploy two pipelines that have slightly differents configs. When we run pipeline A, Kedro keep asking for a credential that is only used in pipeline B.
Currently we're giving fake credentials to the target orchestration Patform, so the pipeline could run. But this introduce some debt in our credentials management in the orchestration platform.
Adding this to the "Redesign the API for IO (catalog)" for visibility.
Noting that this was also reported as #3804.
I believed the original issue was resolved by fixing the lazy connection in SQL-related dataset already. There seems to be some other issue raised by @takikadiri
It wasn't, generic datasets are still eagerly loaded
We're hitting this issue again, as we tried to deploy two pipelines that have slightly differents configs. When we run pipeline A, Kedro keep asking for a credential that is only used in pipeline B.
Currently we're giving fake credentials to the target orchestration Patform, so the pipeline could run. But this introduce some debt in our credentials management in the orchestration platform.
@takikadiri We also run a monorepo, with several overlapping projects/pipelines. We segment the projects by subfolder both in src code and parameters, pipelines and datacatalog. We control this via env vars that are read in settings.py
(catalog and parameters) and pipeline_registry.py
for pipelines. There are also universal/general catalog and parameters that are loaded for all projects/pipelines. This way, a session/context is created with only the catalog and parameters that are needed for the pipeline that we want to run and a developer working within one project cannot break another project.
However, we have the opposite problem when it comes to testing, we cannot iterate through different sessions to test the makeup of the catalogs because subsequent sessions created in code (confirmed different session ids, etc.) for some reason still contains the first context. see: #4087