etl
etl copied to clipboard
Tracking issue: improve `etl-datadiff`
Suggested improvements:
- [ ] Move datadiff to automated staging server which would give it persistence (and a new commit wouldn't be rebuilding everything from scratch)
- [ ] Efficiently pick datasets to compare when using
REMOTE
based on checksum from the catalog (this is essential for moving it to automated staging server, otherwise we'd download all datasets from the catalog) - [ ] Use the correct base branch of data, e.g. by using
git-lfs
anddata-catalog
- [ ] Avoid repeating same origins diff for multiple indicators - this tends to clutter the output
- This could be done by collecting all unique origins for a dataset and comparing them in a special section. Then, for variables, we'd just mention the name of the changed origin.
- [ ] Make it very easy to diff against the previous version of a single dataset
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Keep it open a bit longer.
FWIW, I am deprecating the alias etl-datadiff
in favor of new etlcli diff
See https://github.com/owid/etl/pull/2293
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Most of these got implemented.