etl icon indicating copy to clipboard operation
etl copied to clipboard

Snapshot changes are ignored by ETL

Open pabloarosado opened this issue 9 months ago • 2 comments

Following this discussion, I realised that, if you make changes to a snapshot (either the .py or the .dvc file) and re-run ETL, no changes are detected.

I have also noticed that this happens since this PR, where we changed how checksums are calculated. When I run ETL prior to that commit, snapshot changes do trigger ETL runs.

I don't know if this behavior is expected, or if it's just a bug, but I think the old behavior is preferrable. I remember we had some discussions comparing the old and new checksum procedure, and the old one seemed more robust (although more computationally costly). I'm still unsure of the benefits of the new procedure, but I'll leave this up to @Marigold .

pabloarosado avatar May 03 '24 11:05 pabloarosado

Thanks for raising this!

Some additional thoughts after looking at it.

It looks like changes in Snapshot are only detected if the outs[0].md5 field in the Snapshot metadata changes. This hash only changes if the user re-runs the snapshot step manually, right? AFAIK this hash is based on the output generated from Snapshot.

So, this leaves various changes in Snapshot undetected:

  • If a user changes a field in the dvc file of the Snapshot (e.g. citation_full), the hash won't capture this and therefore the snapshot step won't be detect as changed. I believe this changed indeed in https://github.com/owid/etl/pull/2514.
  • If the user changes the Snapshot python script and does not run it again, etl won't detect this either. I am not sure we have ever detected this changes actually, even before https://github.com/owid/etl/pull/2514.

I think that Snapshot should work similarly to Meadow, Garden, etc. ETL should be able to detect changes in snapshot scripts and metadata files (note that metadata file includes the md5 hash, which captures changes in the output snapshot file).

lucasrodes avatar May 03 '24 11:05 lucasrodes

@Marigold shall we close this issue?

lucasrodes avatar May 06 '24 08:05 lucasrodes