owid-catalog-py icon indicating copy to clipboard operation
owid-catalog-py copied to clipboard

feature: Properly track metadata and processing of each variable

Open pabloarosado opened this issue 1 year ago • 0 comments

Very often when we do simple operations on a variable, the metadata disappears. We need to:

  1. Ensure the metadata is inherited properly (when possible), e.g. if tb["c"] = tb["a"] + tb["b"], the new variable c should have the union of sources and licenses of a and b.
  2. Keep a log of all processing done to a variable, e.g. "variable loaded from table ...", "variable c created as the sum of variables a and b", etc.

I started implementing this logic in this branch (and created a PR). But there's some more work to be done, to ensure the changes are robust, and to include additional logic and features.

I also created an etl branch to test these changes on a simple dataset. We may decide to delete this etl branch in the future if things change significantly.

Once done implementing these features, we would need to ensure that all active ETL steps work without any modification (and check that they don't take much longer to run). To migrate to a workflow where we properly handle metadata and keep a processing log, we could start by adding a default processing log to each variable in ETL, which has 3 entries: "variable loaded from table ...", "data processing", "variable saved to table ...". Then, whenever each step is updated, the code could be refactored to properly build the processing log.

pabloarosado avatar May 25 '23 10:05 pabloarosado