etl
etl copied to clipboard
Bug in `add_regions_to_table` when using `countries_that_must_have_data` (in an unusual situation)
Problem
In an unusual situation for aggregation, World
can have a value, even though Asia
has no value, since China
has no value.
Specific example
I noticed this error while working in minerals, because some aggregates (e.g. High-income countries) had larger values than the World.
In the following situation:
REGIONS = {**geo.REGIONS, **{"World": {}}}
tb = geo.add_regions_to_table(
tb=tb,
regions=REGIONS,
ds_regions=ds_regions,
ds_income_groups=ds_income_groups,
countries_that_must_have_data={
"Asia": ["China"],
"World": ["Asia"],
},
)
-
China
does not have data, soAsia
does not have data -
World
does have data, even thoughAsia
does not have data
Expected behaviour
If Asia
does not have data, then World
should not have data.
Technical notes
- This issue may be tricky to fix. At least, we could raise a warning.
- We should write a unit test for this, and then ideally fix it
- ...but fixing it could potentially mean changes for a large number of datasets, so we would need to increment the EPOCH and check the diffs of the output
- ...ideally we would only change behaviour for steps that use
countries_that_must_have_data