etl
etl copied to clipboard
wizard: collect bugs / improvements
After #1539 you can now use the new tool wizard
to generate templates for your ETL steps.
Use this issue to report bugs that you may encounter. If the issue is very complex, feel free to create a separate - and more extense - issue.
I'm using wizard now for the first time, so I'll write here all issues I find along the way:
- [x] When I open wizard, I get the warning:
2023-09-01 08:37:55.275 WARNING streamlit.runtime.caching.cache_data_api: No runtime found, using MemoryCacheStorageManager
['streamlit', 'run', '/Users/prosado/Documents/owid/repos/etl/apps/wizard/app.py', '--server.port', '8053', '--', '--phase', 'all', '--run-checks']
- [x] I think the "Generate playground notebook" (in the meadow/garden steps) should be unselected by default (I don't think many of us are using it, but I may be wrong). When I unselect it and submit, I get the error:
FileNotFoundError: [Errno 2] No such file or directory: '/Users/prosado/Documents/owid/repos/etl/etl/steps/data/meadow/animal_welfare/2023-09-01/playground.ipynb'
Traceback:
File "/Users/prosado/Documents/owid/repos/etl/.venv/lib/python3.11/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 552, in _run_script
exec(code, module.__dict__)
File "/Users/prosado/Documents/owid/repos/etl/apps/wizard/templating/meadow.py", line 212, in <module>
os.remove(notebook_path)
- [x] After submitting a meadow/garden step, the new files are mentioned on the new page (under "Generated files"). But I think the
dag
file should also be mentioned, if it has been modified. - [x] When loading the resulting snapshot, I was getting a strange error. I realised that
snapshot.metadata.license
was None, even iflicense
was defined in the.dvc
file. It took me a while to realise that the produced.dvc
file hadlicense
indented too many spaces, hence becoming an item of oforigin
. I think they are meant to be separate fields in the snapshot metadata. Once I removed the indent oflicense
(and its items), the issue was fixed. So, I suppose it's a bug, and wizard should not storelicense
insideorigin
(we can discuss if that should be indeed the correct structure). - [x] Field
citation_producer
should be expandable, given that it can be very long (as dataset description fields are).
These issues should be solved in https://github.com/owid/etl/pull/1567
@Marigold @pabloarosado From our discussions and the documentation on Notion, I had assumed that we agreed that the field license
should be under $.meta.origin
.
However, from what Pablo is saying, it sounds like this is raising an error. As a solution for now, I've set wizard to export the field license
both at $.meta.origin.license
and $.meta.license
levels.
Do you think this is fine?
It took me a while to realise that the produced .dvc file had license indented too many spaces, hence becoming an item of of origin. I think they are meant to be separate fields in the snapshot metadata.
The original idea was to keep it under origin, not separate from it. license
is there only for backward compatibility. Ideally, snapshot meta would have either source & license
or origin & origin.license
(we could refactor old files if this causes too much confusion).
Thanks for clarifying, Mojmir. Yes, that's what I had in mind.
@pabloarosado could you share code to reproduce the error?
walkthrough bug that might be occurring here too: https://github.com/owid/etl/issues/1571
- [x] I see that
update_period_days
is mandatory in the guideline, but it is not available by default in any of the YAML files
Two related comments to the ones above:
- [x]
description_processing
is not shown by default when walkthrough garden generates the yaml, and it's a recommended field - [x]
description
is automatically generated in the yaml file, when I see it is not used anymore
In general it would be good to match that file with the final version of the guidelines.
Hey! Should I just reopen for a small issue then?
- [x] The create playground notebook option in wizard doesn't generate a Jupyter Notebook
(copied from https://github.com/owid/etl/issues/1566#issuecomment-1723433161)
It's not super clear to me whether I should be hitting return after filling in each section or not? It seems like a good thing to somewhat validate each entry, but also the form completed unexpectedly early when pressing it, meaning I skipped the last couple of sections.
- [x] When the YAML for garden is generated, there are two localhost links that don't work:
# NOTE: To learn more about the fields, hover over their names.
# Learn more about the available fields:
# http://localhost:8000/architecture/metadata/reference/dataset/
dataset:
update_period_days: 365
# Learn more about the available fields:
# http://localhost:8000/architecture/metadata/reference/tables/
- [ ] The video explaining the harmonization tool is outdated:
Harmonize country names with the following command (assuming country field is called country). Check out a short demo of the tool
- [ ] I think the options to add regions and population data available on walkthrough should also be available on wizard as well.
@spoonerf, good point, but unfortunately, clicking on ENTER is equivalent to submitting the entire form. It will be submitted if the form is valid (i.e., all required fields are present).
@pabloarosado
I think the options to add regions and population data available on walkthrough should also be available on wizard as well.
Which options do you refer to here? I just executed walkthrough
and there aren't - at least on my end - any options to add regions or population datasets.
@lucasrodes Maybe it was removed recently, but they were available as check boxes at the # end of walkthrough garden
- [x] Improve license selector: remove spurious options, add missing ones
- [x] Have namespace as a selector, like the license. Should be based on the names of the folders in
etl/steps/data/channel/
- [x]
variable.description_processing
should be printed in the dummy yaml inetl-wizard garden
step (to not forget to use it when necessary)
- [x] In
etl-wizard snapshot
, in the License section, there is the©producer year
option, but it prints©producer year
and not©{producer} {year}
like it's done inorigin.attribution
Found here https://github.com/owid/etl/pull/1825#discussion_r1367034367
-
[ ] It would be nice to have the
Select local file to import option
with the option to navigate the OS and select the file instead of writing a path that can be long -
[x] When adding a year instead of a date to
date_published
, it printsyear
without quotes. I think it's with quotes to comply with the schema
UPDATE: the current recommended workflow is based on chart-diff. So won't be addressing this issues.
- [x] ~In the chart revisions baker, I tried to move one chart from here to here and after the balloons 🎈 : the links for live and staging (I don't use local) were not updated with the new chart in the chart revision tool.~
- [x] ~I restarted and tried again: I got this error message:~
Something went wrong! (MySQLdb.IntegrityError) (1062, "Duplicate entry '1904-3-3-1' for key 'suggested_chart_revisions.chartId'") [SQL: INSERT INTO suggested_chart_revisions (chartId, createdBy, originalConfig, suggestedConfig, status, createdAt, updatedAt) VALUES (%s, %s, %s, %s, %s, %s, %s)] [parameters: (1904, 59, '{"id": 1904, "map": {"colorScale": {"baseColorScheme": "YlGn", "binningStrategy": "manual", "customNumericColors": [null, null, null, null, null, nul ... (2786 characters truncated) ... 71}], "sourceDesc": "The World Bank", "isPublished": true, "selectedEntityNames": ["Azerbaijan", "Antigua and Barbuda", "Rwanda", "Sudan", "Brazil"]}', '{"id": 1904, "map": {"colorScale": {"baseColorScheme": "YlGn", "binningStrategy": "manual", "customNumericColors": [null, null, null, null, null, nul ... (651 characters truncated) ... 47}], "sourceDesc": "The World Bank", "isPublished": true, "selectedEntityNames": ["Azerbaijan", "Antigua and Barbuda", "Rwanda", "Sudan", "Brazil"]}', 'pending', datetime.datetime(2023, 11, 1, 16, 19, 38, 280631), datetime.datetime(2023, 11, 1, 16, 19, 38, 280631))] (Background on this error at: https://sqlalche.me/e/14/gkpj)
I saved this draft chart to replicate (indicator here).
- [x] Enable setting update frequency to zero (= this dataset won't be updated anymore) [slack thread]
- [x] The links "Explore dataset" in
etl-wizard charts
don't actually work. They have the formathttp://localhost:8053/None/datasets/{dataset_id}/
- [x] I can't find one indicator in
etl-wizard charts
: when trying to migrate from[818136] $30 a day - Number not in poverty (Estimated)
to[818156] $30 a day - Number not in poverty (Smoothed)
I actually can't find the latter.I am using dataset [6341] uniquely (migrating in the same dataset).
EDIT: Also with the migrations
-
[818125] $1.90 a day - Number in poverty (Estimated)
>[818144] $1.90 a day - Number in poverty (Smoothed)
-
[818142] $5-$10 - Number in poverty (Estimated)
>[818162] $5-$10 - Number in poverty (Smoothed)
- [x] When option is selected not to export to any DAG, still we get the instruction
Probably there's not much to do, but
- [x] Variables from datasets 489 and 6398 can't be compared in the Charts upgrader because of this error:
I see that the csv here includes ..
for some data points.
@paarriagadap, thanks for reporting. This issue is, as you mention, because there are non-numeric values in the dataset. Currently, the explore mode only supports comparing numerical values. We could compare categorical ones in the future if we needed to.
However, in this case, I think the indicators should be numbers and ..
should be NaN
instead. That way, the indicator would be recognised as numeric. Also, why do we have ..
there? I bet that's a legacy dataset, and we won't be fixing this (removing ..
). Unfortunately, implementing some logic to replace ..
so that we can compare the data is too specific and don't think we should do it.
- [x] (WON'T DO: SLOWS THE TOOL CONSIDERABLY) The relative difference column in the explore variable mappings option in the Charts upgrader is not colored anymore. I thought it was a nice addition to draw immediate attention to wildly different numbers.
Chart revision:
- [ ] Enable some option to "load from mapping", where the user can load a json with the variable mapping
- [x] If no OPENAI_API_KEY is available, the app will crash miserably if trying to get chatGPT suggestions. Instead, it should just show a warning and let the user continue! We need API KEY in staging servers.
I see that some classic issues with the chart revision tool still happen occasionally:
- The sorting in bar charts is not respected (e.g. this chart):
- The min and max values in the timeline are not respected (e.g. this chart):
It's not urgent, and I have already manually fixed the affected cases. I just mentioned them here so we are aware that these minor things still happen.
I'll be looking at chart revisions this week, so I will take a look if there's an easy fix for that. (Examples like that are super useful by the way)