etl
etl copied to clipboard
:tada: engineering: add prefect engine when running ETL
Adds option --engine
for specifying which scheduler to use (default is --engine etl
). Using --engine prefect
orchestrates ETL with Prefect and creates SQLite file that could be inspected with Prefect UI.
The Prefect UI runs on http://staging-site-prefect:4200/flow-runs (there's a new link from Wizard). Here's an example of a run after changing regions.
It runs steps concurrently with a single worker (default) and uses Dask with multiple workers (flag --workers
).
Comparison to --engine etl
- Prefect doesn't interrupt other tasks on error and tries to complete as many tasks as possible
- When using multiple workers, it's easy to find the failing task and the exception (it's not as trivial from Buildkite logs)
-
structlog.info
adds colour to output, but Prefect can't decode it and prints characters like[0m [[32m[1minfo
. This should be soon fixed in https://github.com/PrefectHQ/prefect-ui-library/pull/2582 - There's not much overhead from Prefect compared to multiprocessing
- We could leverage Dask features (e.g. memory limiting), but it's also likely that a naive multiprocessing is good enough for us and we shouldn't complicate our lives (like with DVC)
My 2 cents
I'd find it very helpful for inspecting ETL runs on staging servers and in production (where I find searching through logs really annoying). We could give it a try and see if it was worth it in a few weeks. Prefect could also be useful for automatic dataset updates, which currently exist as bash scripts and are run by Buildkite. It works, but as @lucasrodes suggested, we might need more flexibility.
TODO before merging
- [ ] Undo changes to regions.yml