etl icon indicating copy to clipboard operation
etl copied to clipboard

Tracking issue: improve `etl-datadiff`

Open Marigold opened this issue 1 year ago • 3 comments

Suggested improvements:

  • [ ] Move datadiff to automated staging server which would give it persistence (and a new commit wouldn't be rebuilding everything from scratch)
  • [ ] Efficiently pick datasets to compare when using REMOTE based on checksum from the catalog (this is essential for moving it to automated staging server, otherwise we'd download all datasets from the catalog)
  • [ ] Use the correct base branch of data, e.g. by using git-lfs and data-catalog
  • [ ] Avoid repeating same origins diff for multiple indicators - this tends to clutter the output
    • This could be done by collecting all unique origins for a dataset and comparing them in a special section. Then, for variables, we'd just mention the name of the changed origin.
  • [ ] Make it very easy to diff against the previous version of a single dataset

Marigold avatar Dec 15 '23 14:12 Marigold

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Feb 16 '24 12:02 stale[bot]

Keep it open a bit longer.

Marigold avatar Feb 16 '24 13:02 Marigold

FWIW, I am deprecating the alias etl-datadiff in favor of new etlcli diff

See https://github.com/owid/etl/pull/2293

lucasrodes avatar Feb 16 '24 22:02 lucasrodes

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar May 07 '24 01:05 stale[bot]

Most of these got implemented.

Marigold avatar May 07 '24 07:05 Marigold