etl icon indicating copy to clipboard operation
etl copied to clipboard

wizard: collect bugs / improvements

Open lucasrodes opened this issue 1 year ago • 41 comments

After #1539 you can now use the new tool wizard to generate templates for your ETL steps.

Use this issue to report bugs that you may encounter. If the issue is very complex, feel free to create a separate - and more extense - issue.

lucasrodes avatar Aug 30 '23 18:08 lucasrodes

I'm using wizard now for the first time, so I'll write here all issues I find along the way:

  • [x] When I open wizard, I get the warning:
2023-09-01 08:37:55.275 WARNING streamlit.runtime.caching.cache_data_api: No runtime found, using MemoryCacheStorageManager
['streamlit', 'run', '/Users/prosado/Documents/owid/repos/etl/apps/wizard/app.py', '--server.port', '8053', '--', '--phase', 'all', '--run-checks']
  • [x] I think the "Generate playground notebook" (in the meadow/garden steps) should be unselected by default (I don't think many of us are using it, but I may be wrong). When I unselect it and submit, I get the error:
FileNotFoundError: [Errno 2] No such file or directory: '/Users/prosado/Documents/owid/repos/etl/etl/steps/data/meadow/animal_welfare/2023-09-01/playground.ipynb'
Traceback:
File "/Users/prosado/Documents/owid/repos/etl/.venv/lib/python3.11/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 552, in _run_script
    exec(code, module.__dict__)
File "/Users/prosado/Documents/owid/repos/etl/apps/wizard/templating/meadow.py", line 212, in <module>
    os.remove(notebook_path)
  • [x] After submitting a meadow/garden step, the new files are mentioned on the new page (under "Generated files"). But I think the dag file should also be mentioned, if it has been modified.
  • [x] When loading the resulting snapshot, I was getting a strange error. I realised that snapshot.metadata.license was None, even if license was defined in the .dvc file. It took me a while to realise that the produced .dvc file had license indented too many spaces, hence becoming an item of of origin. I think they are meant to be separate fields in the snapshot metadata. Once I removed the indent of license (and its items), the issue was fixed. So, I suppose it's a bug, and wizard should not store license inside origin (we can discuss if that should be indeed the correct structure).
  • [x] Field citation_producer should be expandable, given that it can be very long (as dataset description fields are).

pabloarosado avatar Sep 01 '23 06:09 pabloarosado

These issues should be solved in https://github.com/owid/etl/pull/1567

lucasrodes avatar Sep 01 '23 14:09 lucasrodes

@Marigold @pabloarosado From our discussions and the documentation on Notion, I had assumed that we agreed that the field license should be under $.meta.origin.

However, from what Pablo is saying, it sounds like this is raising an error. As a solution for now, I've set wizard to export the field license both at $.meta.origin.license and $.meta.license levels.

Do you think this is fine?

lucasrodes avatar Sep 01 '23 14:09 lucasrodes

It took me a while to realise that the produced .dvc file had license indented too many spaces, hence becoming an item of of origin. I think they are meant to be separate fields in the snapshot metadata.

The original idea was to keep it under origin, not separate from it. license is there only for backward compatibility. Ideally, snapshot meta would have either source & license or origin & origin.license (we could refactor old files if this causes too much confusion).

Marigold avatar Sep 04 '23 05:09 Marigold

Thanks for clarifying, Mojmir. Yes, that's what I had in mind.

@pabloarosado could you share code to reproduce the error?

lucasrodes avatar Sep 04 '23 09:09 lucasrodes

walkthrough bug that might be occurring here too: https://github.com/owid/etl/issues/1571

lucasrodes avatar Sep 05 '23 10:09 lucasrodes

  • [x] I see that update_period_days is mandatory in the guideline, but it is not available by default in any of the YAML files

paarriagadap avatar Sep 11 '23 09:09 paarriagadap

Two related comments to the ones above:

  • [x] description_processing is not shown by default when walkthrough garden generates the yaml, and it's a recommended field
  • [x] description is automatically generated in the yaml file, when I see it is not used anymore

In general it would be good to match that file with the final version of the guidelines.

paarriagadap avatar Sep 11 '23 12:09 paarriagadap

Hey! Should I just reopen for a small issue then?

  • [x] The create playground notebook option in wizard doesn't generate a Jupyter Notebook

(copied from https://github.com/owid/etl/issues/1566#issuecomment-1723433161)

lucasrodes avatar Sep 18 '23 13:09 lucasrodes

It's not super clear to me whether I should be hitting return after filling in each section or not? It seems like a good thing to somewhat validate each entry, but also the form completed unexpectedly early when pressing it, meaning I skipped the last couple of sections.

spoonerf avatar Sep 18 '23 14:09 spoonerf

  • [x] When the YAML for garden is generated, there are two localhost links that don't work:
# NOTE: To learn more about the fields, hover over their names.


# Learn more about the available fields:
# http://localhost:8000/architecture/metadata/reference/dataset/
dataset:
  update_period_days: 365


# Learn more about the available fields:
# http://localhost:8000/architecture/metadata/reference/tables/

paarriagadap avatar Sep 18 '23 15:09 paarriagadap

  • [ ] The video explaining the harmonization tool is outdated:

Harmonize country names with the following command (assuming country field is called country). Check out a short demo of the tool

  • [ ] I think the options to add regions and population data available on walkthrough should also be available on wizard as well.

paarriagadap avatar Sep 18 '23 16:09 paarriagadap

@spoonerf, good point, but unfortunately, clicking on ENTER is equivalent to submitting the entire form. It will be submitted if the form is valid (i.e., all required fields are present).

lucasrodes avatar Sep 18 '23 20:09 lucasrodes

@pabloarosado

I think the options to add regions and population data available on walkthrough should also be available on wizard as well.

Which options do you refer to here? I just executed walkthrough and there aren't - at least on my end - any options to add regions or population datasets.

lucasrodes avatar Sep 18 '23 20:09 lucasrodes

@lucasrodes Maybe it was removed recently, but they were available as check boxes at the # end of walkthrough garden

paarriagadap avatar Sep 18 '23 22:09 paarriagadap

  • [x] Improve license selector: remove spurious options, add missing ones
  • [x] Have namespace as a selector, like the license. Should be based on the names of the folders in etl/steps/data/channel/

lucasrodes avatar Oct 02 '23 16:10 lucasrodes

  • [x] variable.description_processing should be printed in the dummy yaml in etl-wizard garden step (to not forget to use it when necessary)

paarriagadap avatar Oct 20 '23 09:10 paarriagadap

  • [x] In etl-wizard snapshot, in the License section, there is the ©producer year option, but it prints ©producer year and not ©{producer} {year} like it's done in origin.attribution

Found here https://github.com/owid/etl/pull/1825#discussion_r1367034367

paarriagadap avatar Oct 20 '23 15:10 paarriagadap

  • [ ] It would be nice to have the Select local file to import option with the option to navigate the OS and select the file instead of writing a path that can be long image

  • [x] When adding a year instead of a date to date_published, it prints year without quotes. I think it's with quotes to comply with the schema

paarriagadap avatar Oct 30 '23 12:10 paarriagadap

UPDATE: the current recommended workflow is based on chart-diff. So won't be addressing this issues.


  • [x] ~In the chart revisions baker, I tried to move one chart from here to here and after the balloons 🎈 : the links for live and staging (I don't use local) were not updated with the new chart in the chart revision tool.~
  • [x] ~I restarted and tried again: I got this error message:~
Something went wrong! (MySQLdb.IntegrityError) (1062, "Duplicate entry '1904-3-3-1' for key 'suggested_chart_revisions.chartId'") [SQL: INSERT INTO suggested_chart_revisions (chartId, createdBy, originalConfig, suggestedConfig, status, createdAt, updatedAt) VALUES (%s, %s, %s, %s, %s, %s, %s)] [parameters: (1904, 59, '{"id": 1904, "map": {"colorScale": {"baseColorScheme": "YlGn", "binningStrategy": "manual", "customNumericColors": [null, null, null, null, null, nul ... (2786 characters truncated) ... 71}], "sourceDesc": "The World Bank", "isPublished": true, "selectedEntityNames": ["Azerbaijan", "Antigua and Barbuda", "Rwanda", "Sudan", "Brazil"]}', '{"id": 1904, "map": {"colorScale": {"baseColorScheme": "YlGn", "binningStrategy": "manual", "customNumericColors": [null, null, null, null, null, nul ... (651 characters truncated) ... 47}], "sourceDesc": "The World Bank", "isPublished": true, "selectedEntityNames": ["Azerbaijan", "Antigua and Barbuda", "Rwanda", "Sudan", "Brazil"]}', 'pending', datetime.datetime(2023, 11, 1, 16, 19, 38, 280631), datetime.datetime(2023, 11, 1, 16, 19, 38, 280631))] (Background on this error at: https://sqlalche.me/e/14/gkpj)

I saved this draft chart to replicate (indicator here).

paarriagadap avatar Nov 01 '23 17:11 paarriagadap

  • [x] Enable setting update frequency to zero (= this dataset won't be updated anymore) [slack thread]

lucasrodes avatar Nov 29 '23 15:11 lucasrodes

  • [x] The links "Explore dataset" in etl-wizard charts don't actually work. They have the format http://localhost:8053/None/datasets/{dataset_id}/
image

paarriagadap avatar Nov 30 '23 14:11 paarriagadap

  • [x] I can't find one indicator in etl-wizard charts: when trying to migrate from [818136] $30 a day - Number not in poverty (Estimated) to [818156] $30 a day - Number not in poverty (Smoothed) I actually can't find the latter.I am using dataset [6341] uniquely (migrating in the same dataset).

EDIT: Also with the migrations

  • [818125] $1.90 a day - Number in poverty (Estimated) > [818144] $1.90 a day - Number in poverty (Smoothed)
  • [818142] $5-$10 - Number in poverty (Estimated) > [818162] $5-$10 - Number in poverty (Smoothed)

paarriagadap avatar Jan 22 '24 10:01 paarriagadap

  • [x] When option is selected not to export to any DAG, still we get the instruction image

lucasrodes avatar Feb 15 '24 10:02 lucasrodes

Probably there's not much to do, but

  • [x] Variables from datasets 489 and 6398 can't be compared in the Charts upgrader because of this error: image

I see that the csv here includes .. for some data points.

paarriagadap avatar Feb 28 '24 17:02 paarriagadap

@paarriagadap, thanks for reporting. This issue is, as you mention, because there are non-numeric values in the dataset. Currently, the explore mode only supports comparing numerical values. We could compare categorical ones in the future if we needed to.

However, in this case, I think the indicators should be numbers and .. should be NaN instead. That way, the indicator would be recognised as numeric. Also, why do we have .. there? I bet that's a legacy dataset, and we won't be fixing this (removing ..). Unfortunately, implementing some logic to replace .. so that we can compare the data is too specific and don't think we should do it.

lucasrodes avatar Feb 28 '24 22:02 lucasrodes

  • [x] (WON'T DO: SLOWS THE TOOL CONSIDERABLY) The relative difference column in the explore variable mappings option in the Charts upgrader is not colored anymore. I thought it was a nice addition to draw immediate attention to wildly different numbers. image

paarriagadap avatar Feb 29 '24 13:02 paarriagadap

Chart revision:

  • [ ] Enable some option to "load from mapping", where the user can load a json with the variable mapping
  • [x] If no OPENAI_API_KEY is available, the app will crash miserably if trying to get chatGPT suggestions. Instead, it should just show a warning and let the user continue! We need API KEY in staging servers.

lucasrodes avatar Mar 12 '24 00:03 lucasrodes

I see that some classic issues with the chart revision tool still happen occasionally:

  • The sorting in bar charts is not respected (e.g. this chart): Screenshot 2024-03-22 at 09 27 16
  • The min and max values in the timeline are not respected (e.g. this chart): Screenshot 2024-03-22 at 09 52 17

It's not urgent, and I have already manually fixed the affected cases. I just mentioned them here so we are aware that these minor things still happen.

pabloarosado avatar Mar 22 '24 16:03 pabloarosado

I'll be looking at chart revisions this week, so I will take a look if there's an easy fix for that. (Examples like that are super useful by the way)

Marigold avatar Mar 25 '24 11:03 Marigold