etl icon indicating copy to clipboard operation
etl copied to clipboard

Keep better track of when the next update is due

Open pabloarosado opened this issue 10 months ago • 4 comments

Problem & impact

We don't have an easy, accurate way to know when a specific dataset should be updated. This information is crucial to prioritize the update of datasets and plan.

Background

We currently have update_period_days as a dataset metadata field in garden steps. But with this information only, we don't have an unambiguous way to calculate the expected date of update:

  • The ETL dashboard simply adds version + update_period_days to estimate the date of update. But we often carry out "minor updates" of datasets.
  • We could calculate the date of update based on the publication date of a dataset's origins. But when there are multiple origins, it's unclear which one to choose. Given that we often have auxiliary datasets, like population, this is probably not a valid option.

Possible solution

We could have a new metadata field, date_update_due (or some other name) in all steps. By default, it could be set in the snapshot metadata, and propagated automatically to other steps. In a garden step, this date could be manually updated if required.

From the ETL dashboard, it would be convenient (e.g. for Ed) to be able to update this date manually.

Open questions

  • How should we think about data availability vs appetite to update?
    • These things are often out of sync, e.g. there is new data for a low-prio dataset, or there is no new data for a high-prio dataset when we expected it
  • How should we backfill it when it's not set?
    • Propagate it from the snapshot onwards
    • Merge it using the rule "soonest" of available dates

Technical notes

  • Since it's about adding a new metadata field, there will be secondary things to do (adding to schemas, Table, etc)

Alternatives

  • Keep what we have, but manually change update_period_days when we want to say that something is due sooner/later, and add update_period_days at the snapshot level

pabloarosado avatar Mar 27 '24 14:03 pabloarosado

➡️ We agreed to discuss it in the next Monday data meeting

larsyencken avatar Mar 28 '24 10:03 larsyencken

We said "nice to have" since we do have a system right now based on automatic Github issues, although it has limitations.

@edomt Feel free to bump this up in priority, according to how important improving the planning is on your end.

larsyencken avatar Apr 11 '24 09:04 larsyencken

I agree with "nice to have"; the current system of automatic GitHub issues, and ETL dashboard showing version + update_period_days seems good enough for the medium term.

edomt avatar Apr 11 '24 09:04 edomt

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Jun 10 '24 23:06 stale[bot]

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Aug 11 '24 03:08 stale[bot]