etl icon indicating copy to clipboard operation
etl copied to clipboard

:tada: engineering: add prefect engine when running ETL

Open Marigold opened this issue 6 months ago • 13 comments

Adds option --engine for specifying which scheduler to use (default is --engine etl). Using --engine prefect orchestrates ETL with Prefect and creates SQLite file that could be inspected with Prefect UI.

The Prefect UI runs on http://staging-site-prefect:4200/flow-runs (there's a new link from Wizard). Here's an example of a run after changing regions.

It runs steps concurrently with a single worker (default) and uses Dask with multiple workers (flag --workers).

Comparison to --engine etl

  • Prefect doesn't interrupt other tasks on error and tries to complete as many tasks as possible
  • When using multiple workers, it's easy to find the failing task and the exception (it's not as trivial from Buildkite logs)
  • structlog.info adds colour to output, but Prefect can't decode it and prints characters like [0m [[32m[1minfo. This should be soon fixed in https://github.com/PrefectHQ/prefect-ui-library/pull/2582
  • There's not much overhead from Prefect compared to multiprocessing
  • We could leverage Dask features (e.g. memory limiting), but it's also likely that a naive multiprocessing is good enough for us and we shouldn't complicate our lives (like with DVC)

My 2 cents

I'd find it very helpful for inspecting ETL runs on staging servers and in production (where I find searching through logs really annoying). We could give it a try and see if it was worth it in a few weeks. Prefect could also be useful for automatic dataset updates, which currently exist as bash scripts and are run by Buildkite. It works, but as @lucasrodes suggested, we might need more flexibility.

TODO before merging

  • [ ] Undo changes to regions.yml

Marigold avatar Jul 29 '24 07:07 Marigold