etl
etl copied to clipboard
Keep better track of when the next update is due
Problem & impact
We don't have an easy, accurate way to know when a specific dataset should be updated. This information is crucial to prioritize the update of datasets and plan.
Background
We currently have update_period_days
as a dataset metadata field in garden steps. But with this information only, we don't have an unambiguous way to calculate the expected date of update:
- The ETL dashboard simply adds
version
+update_period_days
to estimate the date of update. But we often carry out "minor updates" of datasets. - We could calculate the date of update based on the publication date of a dataset's origins. But when there are multiple origins, it's unclear which one to choose. Given that we often have auxiliary datasets, like
population
, this is probably not a valid option.
Possible solution
We could have a new metadata field, date_update_due
(or some other name) in all steps. By default, it could be set in the snapshot metadata, and propagated automatically to other steps. In a garden step, this date could be manually updated if required.
From the ETL dashboard, it would be convenient (e.g. for Ed) to be able to update this date manually.
Open questions
- How should we think about data availability vs appetite to update?
- These things are often out of sync, e.g. there is new data for a low-prio dataset, or there is no new data for a high-prio dataset when we expected it
- How should we backfill it when it's not set?
- Propagate it from the snapshot onwards
- Merge it using the rule "soonest" of available dates
Technical notes
- Since it's about adding a new metadata field, there will be secondary things to do (adding to schemas, Table, etc)
Alternatives
- Keep what we have, but manually change
update_period_days
when we want to say that something is due sooner/later, and addupdate_period_days
at the snapshot level
➡️ We agreed to discuss it in the next Monday data meeting
We said "nice to have" since we do have a system right now based on automatic Github issues, although it has limitations.
@edomt Feel free to bump this up in priority, according to how important improving the planning is on your end.
I agree with "nice to have"; the current system of automatic GitHub issues, and ETL dashboard showing version
+ update_period_days
seems good enough for the medium term.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.