etl icon indicating copy to clipboard operation
etl copied to clipboard

🎉 Create function to import run from a previous version of a step

Open pabloarosado opened this issue 11 months ago • 6 comments

Summary

Create a helper function to import the run function from a previous version. This could let us avoid duplicating a lot of code, every time we update a data step with no changes in the code.

Motivation

The ETL currently relies on having a recipe for each dataset. But this means that we end up having many files with the exact same content.

Having duplicated code is not only bad when refactoring, or upgrading libraries. It is also inconvenient when searching for previous code. It can be quite confusing when many files have the same content.

I think it's very valuable to have easy access to multiple versions of a dataset, and its code. But, because of the need to duplicate code, one may be less keen to keep different versions, and prefer to use "latest" (there are other reasons why "latest" is convenient, but that's out of the scope of this PR).

Solution

This PR would let us have data steps with the following content:

from etl.helpers import load_run_from_previous_version

run = load_run_from_previous_version(__file__, "2024-01-01")

This way, we would avoid duplicated code. And it would also make it more evident when different versions of a dataset have exactly the same data processing.

Possible issues

  • Would the new step be loading the correct dependencies? I've checked that, indeed, the new run module imports the correct dependencies stated in the dag (and not the old ones).
  • I've also checked that, if the old step imports other modules within the same folder (e.g. shared.py), the new step still runs well.
  • (Unlikely) If one accidentally archived the old file, the new step would fail to run. We could find a way to raise an error when this happens, instead of waiting for ETL to fail. But even if ETL fails, it would not be a big deal (we would easily find the issue).
  • If one edits the old file, the new step will be changed. But (1) one has no direct way to be aware of that, and (2) ETL will not consider the new step as "dirty".
  • What if I just want to edit the new metadata? I've checked that, using load_run_from_previous_version and having a new metadata file inside the new folder works as desired: The old code runs with the new dependencies and loads the new metadata. This works as long as the metadata is loaded with create_dataset, and not imported directly with paths.metadata_file (see next point).
  • What if I just want to edit the new country harmonization? Unlike with metadata, a new .countries.json file in the new folder is ignored. I think that the underlying reason is that the paths = PathFinder(__file__) object defined in the old file is still loaded in the new file, and therefore all paths are the old ones.

Given these caveats, with the current implementation, load_run_from_previous_version can safely be used in those cases where the code is exactly duplicated (including metadata and countries). In any other case, if someone wants to edit minor things, it may or may not work, so it's safest to duplicate code and edit it.

pabloarosado avatar Mar 13 '24 09:03 pabloarosado

I'd kill for being able to import from other versions... I think it'd be also useful to load other (helper) functions from the module. For instance

from etl.helpers import load_module_from_previous_version

last_version = load_module_from_previous_version(__file__, "2024-01-01")
last_version.my_func()

also, having an optional module name would be nice in case it differs from the actual module (ran into when trying to import dummy.py from dummy_monster.py).

This is definitely an improvement over the current status quo, but I'd try to brainstorm if we could do even better than this and use native imports, i.e. having a way to run from ..2024-01-01 import run. Some adhoc ideas:

  • Can we name folders v2024_01_01 instead of 2024-01-01?
  • Perhaps we could create symlinks from importable paths (e.g. that v2024_01_01) to our current paths
  • ?

cc @larsyencken if you have any ideas

We obviously don't have to do this now, so I'm up for merging this workaround.

Marigold avatar Mar 14 '24 11:03 Marigold

I don't think I've understood this well enough, although we did talk about it.

The problem it's solving is the inconvenience of having duplicated code and having to identify the right version when searching in VSCode, if I've understood correctly.

The problem it introduces is that you can no longer delete a step, since other steps might be using its code, and that if you're using another step's code, you now depend on that code and have to include it in your manifest for checksumming the step. That seems like quite a downside to swallow.

@Marigold What's the use case here that you'd kill for?

larsyencken avatar Mar 14 '24 16:03 larsyencken

you now depend on that code and have to include it in your manifest for checksumming the step

Oh right, I totally forgot. That changes everything then.

What's the use case here that you'd kill for?

I might have been too dramatic with that. If checksum wasn't the problem (which it is...) then I'd find it really valuable for creating new versions of data and DRYing a lot of code. I guess we can still symlink shared.py module from previous version.

Marigold avatar Mar 14 '24 17:03 Marigold

I think you hit upon my preferred approach, if we are keen on sharing more code: put all the logic in a shared.py module. If it needs to be shared even more broadly, push shared.py up the tree, and catch it in checksums. That would only be a minor change to the checksum logic.

If it's above the version folders, then you need to import it from the other side (from etl.data.steps.garden.faostat import shared).

It still has the side-effect that if you archive a step, it's no longer an isolated and standalone set of code.

Overall, I lean to feeling that DRY is less important than reproducibility, so I still lean towards just having multiple copies of code around for older versions of datasets.

larsyencken avatar Mar 15 '24 07:03 larsyencken

I have moved this discussion to the issue: https://github.com/owid/etl/issues/2419

pabloarosado avatar Mar 15 '24 10:03 pabloarosado

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar May 18 '24 03:05 stale[bot]