etl icon indicating copy to clipboard operation
etl copied to clipboard

wizard: problems and improvements for etl dashboard and step updater

Open pabloarosado opened this issue 11 months ago • 6 comments

One-liner

This tracking issue will list all problems and improvements for the ETL dashboard and StepUpdater.

Issues

  • [x] The button “Add all dependencies” in the Operations list becomes useless if one of the dependencies is population, because it suddenly adds lots of dependencies that do not need update. A possible solution would be to have a tick box to ignore population.
  • [x] The "Clear Operations list" button needs to be pressed twice to work.
  • [x] Bug: Column "step" in steps_df (from StepUpdater) contains only active steps (since they are the only needed ones). However, "direct_usages" (and probably other columns of dependencies) includes also archive steps. This means that, in the Operations list, when adding direct usages, archive steps suddenly appear (and also the dashboard raises an IndexError when trying to access archive steps in steps_df. Fixed by https://github.com/owid/etl/pull/2448
  • [x] Bug: Steps:
    1. Add the latest garden step of long_term_crop_yields to the Operations list.
    2. In the Operations list, click on "Add direct dependencies" of long_term_crop_yields.
    3. Click on "Add direct dependencies" of long_term_wheat_yields.
    4. Click on "Add direct dependencies" of faostat_qcl.
  • [x] Bug: Function that writes to dag removes the comment below a step that has been updated. For example, when updating the climate change impacts explorer latest step, it removes the comment of the next step. Fixed by https://github.com/owid/etl/pull/2482
  • [x] ~Loading and processing analytics makes VersionTracker and StepUpdater slower. And analytics are only needed for the ETL dashboard. A solution would be to add an optional flag argument to StepUpdater and VersionTracker so that analytics are loaded only optionally. Use it when using it for the dashboard, but not for updates.~ I tried doing this and the time difference was insignificant. Updating many steps (e.g. climate) is still very slow (even over 1 minute), but fetching analyltics or not doesn't make a significant difference.
  • [ ] Dependencies of the owid_co2 and owid_energy steps may appear as archivable, because those owid_* steps do not have any charts. Figure out a way to make them active (and possibly the same for other steps that are used by github repos).
  • [ ] When running step updater from the CLI in interactive mode, you can choose which .py file to update for each snapshot .dvc file. However, in wizard, that's currently not possible, and, if there are multiple .py files in the same folder of a .dvc file (and if none of the .py file names do not match the .dvc file name), then a file will be picked based on edit distance (which is prone to error). A warning is raised, but the unwanted files are duplicated anyway. We should have a way to select which .py files to update for each snapshot.

Improvements and ideas

  • [x] It would be useful to be able to interact with the new steps after they have been created. One possibility would be to have a checkbox "Replace steps in operations list with updated ones". Alternatively, add another operations list below the update button for the new steps.
  • [x] Add a button below the Operations list to remove non-updateable steps from the list.
  • [x] Add a button below the Operations list to run ETL on all steps in the Operations list.
  • [x] owid/owid-issues#1435
  • [ ] Add a button to each entry in the operations list to edit the metadata of the ETL step.
  • [ ] Instead of running updater and etl under the hood, it would be good to run it step by step, and have a progress bar (and log?) shown. If that's not possible, at least it would be good to show the logs in the terminal that runs the dashboard (otherwise, you may see something running for a long time, without knowing what's going on).
  • [ ] Pablo A suggested that updated "latest" steps should be moved to the bottom of the dag, instead of their dependencies being updated in place.
  • [ ] Add button to restart dashboard. Unselecting and selecting grapher and explorer steps is not enough. The button is useful, for example, after doing an update (without using the dashboard). Currently, the only way to update the data shown in the dashboard is to restart the wizard.

pabloarosado avatar Mar 04 '24 08:03 pabloarosado

Hi! I am starting to update World Bank PIP. I selected all the grapher/explorer steps containing world_bank_pip, I added them to the Operations list, then I added all dependencies for the three steps I selected and then I updated the 7 resulting steps.

The files were generated, but the dag was updated not the way I expected:

  • The garden dependency from the explorer step was replaced, I assume because the platform always considers a date by default and not latest. image

  • The updated steps were added at the bottom without the explorer step mentioned above: image

  • The step updater deleted the comment to distinguish another set of steps from the World Inequality Database: image

The step updater also adds the steps right next to the previous dependencies. It would be nice to have a line jump between them: image

I think the new folders and files work fine, so I only need to correct the dag. Thank you very much!

paarriagadap avatar Mar 27 '24 13:03 paarriagadap

The step updater gets rid of the wizard formatting, which might be clearer to read. image image

paarriagadap avatar Mar 27 '24 13:03 paarriagadap

Hi @paarriagadap, thanks a lot for reporting these issues! The issue of removing comments is already known (listed above). I'll try to fix that soon.

If I understand correctly, the issues are related to the formatting of the written files (either the dag or the snapshot metadata files), but the step updater is behaving as expected. In other words, the code generated by the tool is correct, although not great in terms of style. Is that right?

Currently, the step updater either (1) writes new steps in the dag, or (2) overwrites dependencies of existing steps. Steps with "latest" version correspond to case (2) because the step already exists, and its dependencies need to be updated. So you would prefer those updated "latest" steps to be moved the bottom of the dag, as if they were new steps. Is that right?

Please let me know if I misunderstood your issues. Thanks!

pabloarosado avatar Mar 27 '24 13:03 pabloarosado

Hi @pabloarosado, yes, it's mostly formatting and that is better that the latest steps go to the bottom (or it's replicated in both old and new steps). Thank you!

A tiny one I found now is that I had an additional script to extract the data from PIP in the snapshot folder (pip_api.py here), and it wasn't copied to the new step, while there are shared.py scripts in garden that were carried over to the new steps.

paarriagadap avatar Mar 27 '24 13:03 paarriagadap

Thanks for the clarification, I added the suggestion about "latest" steps to the list of improvements.

Regarding pip_api.py file, I think that's expected. The expected scenario is that there is only one code file per snapshot. In this case, that file is actually not generating snapshot, so it's a bit of a special case. Maybe we can figure out a solution for it, but I guess it's very uncommon.

pabloarosado avatar Mar 27 '24 14:03 pabloarosado

should we allocate time for this during this cycle, @pabloarosado ?

lucasrodes avatar Jun 13 '24 12:06 lucasrodes