gusty icon indicating copy to clipboard operation
gusty copied to clipboard

raise error if task dependency does not exist

Open machow opened this issue 2 years ago • 3 comments

Currently, if a task dependency does not exist AFAICT gusty does not raise an error, but creates the task without unmatched dependencies. This has bitten us a couple times, when we rename tasks, or accidentally fat finger in the wrong name. Might be handy for Gusty to raise an error if it can't create a dependency!

Overall, gusty style dependencies has been super handy! Happy to dig into this if useful

machow avatar Dec 13 '21 21:12 machow

For context, this was a conscious decision. The thinking here was primarily around dynamic dependencies, where - for example - a SQL query is parsed for table names (which are in turn task names), but not all tables/tasks are located inside the same DAG. If gusty threw an error because tables/tasks were not found in the same DAG, the dynamic dependencies would not function.

For explicitly listed dependencies, I can see where this would be frustrating. These dependencies are differentiated when constructing the DAG, so there is opportunity to treat explicit dependencies vs. dynamic dependencies differently, e.g. throw errors for explicit, silently discard dynamic.

With this context, what do you think? Is there an approach you would suggest?

chriscardillo avatar Jan 03 '22 14:01 chriscardillo

Ah, thanks for laying out the situation! I didn't realize dynamic tasks were in the mix, so it's helpful to hear about.

I'm not sure I have the procedure gusty uses right based on your response, but this is what I'd imagine could fix (and is maybe roughly what happens now)?:

Within a DAG:

  • parse all files. Each parsed file returns a dict of task config (e.g. yaml topmatter).
  • instantiate airflow tasks from all files (but not dependencies). It sounds like there is a spec for airflow tasks to say their dependencies on instantiation
  • set task dependencies

machow avatar Jan 21 '22 17:01 machow

But looking closer, is there a bad thing that happens if you remove the check here?:

https://github.com/chriscardillo/gusty/blob/e2ed081658cae505bc6708ffb5b25266c917431f/gusty/building.py#L449

so there is opportunity to treat explicit dependencies vs. dynamic dependencies differently, e.g. throw errors for explicit, silently discard dynamic.

If I understand the code correctly, the issue isn't how they're treated, it's that both are allowed to fail silently.

explicit are allowed to fail here: https://github.com/chriscardillo/gusty/blob/e2ed081658cae505bc6708ffb5b25266c917431f/gusty/building.py#L449

both can fail silently here:

https://github.com/chriscardillo/gusty/blob/e2ed081658cae505bc6708ffb5b25266c917431f/gusty/building.py#L496

I would decouple the final declaration of dependencies, and collection of all task ids, from the setting of dependencies via task.set_upstream(). This way you have a big ledger of all final dependencies, and then can test / inspect / decide how to fail separately.

machow avatar Jan 21 '22 17:01 machow