WDI
WDI copied to clipboard
Change `start` and `end` defaults to NULL
In PR #48 @etiennebacher writes:
Question about defaults
The observation status is particularly useful when we import a particular indicator without specifying the dates. On the World Bank API, when no dates are specified, the whole serie is returned. E.g for this indicator, the data goes until 2023, which is why knowing which observation is a forecast is helpful. However, the behavior of
WDI()
regarding dates is a bit strange:
- if I don't specify any start and end dates, the data will be downloaded from the year of creation of the indicator (1960 in most cases) to 2020
WDI("CHN", "NYGDPMKTPKDZ") iso2c country NYGDPMKTPKDZ year 1 CN China 2.3 2020 2 CN China 6.0 2019 ... 21 CN China NA 2000 22 CN China NA 1999
- if I specify
end = NULL
then it gets until the most recent year> WDI("CHN", "NYGDPMKTPKDZ", end = NULL) iso2c country NYGDPMKTPKDZ year 1 CN China 5.3 2023 2 CN China 5.4 2022 3 CN China 8.5 2021 4 CN China 2.3 2020 ... 24 CN China NA 2000 25 CN China NA 1999
- if I specify
start = NULL
andend = NULL
, then it errors> WDI("CHN", "NYGDPMKTPKDZ", start = NULL, end = NULL) Error in WDI("CHN", "NYGDPMKTPKDZ", start = NULL, end = NULL) : Need to specify dates or number of latest values.
So I'm wondering if it would be better to change the default values for start and end, and to change the default behavior. To me, both start and end should be
NULL
if I don't specify anything, and if they areNULL
then it should get all years, even after 2020. I don't know if this would be a breaking change in the code of other people. What do you think?
As far as I can remember, there are two main reasons why these are the start
and end
defaults:
- In older versions of the World Bank API, the dates were actually mandatory, so I had no choice but to include dates.
- I'm a bit worried about the size of datasets for people with slow internet connections.
Then there is, as you pointed out, the possibility of:
- Backward incompatible breakage of people's code.
Issue 1 is no longer a concern with the modern API, so we can ignore.
Do you think Issue 2 is a real problem? My guess is maybe not.
I'm sure someone out there relies on hard-coded default dates, but that feels like bad practice. In general, NULL
seems like a more natural and sane default, so unless we can think of a clear problematic case, and if we can convince ourselves that the slow internet is not a problem, then I'd be in favor of changing this.
- I don't know if it will be a problem for slow internet connections, and we don't really have a way to test this, right? The only thing I tried is to compare the size of the imported data with and without the latest values (i.e when
end = 2020
andend = NULL
):
library(WDI)
x <- WDI("all", "NYGDPMKTPKDZ", end = 2020)
y <- WDI("all", "NYGDPMKTPKDZ", end = NULL)
> dim(x)
[1] 3212 4
> dim(y)
[1] 3650 4
> object.size(x)
108696 bytes
> object.size(y)
120960 bytes
The difference here is ~12 kB, and I don't think it would add a lot of time to download, even with slow connections, but I'm no expert about that. Of course this is not a exhaustive test, maybe the difference is much bigger if one imports a lot of indicators.
- Yeah, basically the new version would return unexpected data if someone didn't specify
start
andend
. I cannot think of a situation where someone would do that without wanting to get the latest years, or without filtering the data afterwards (especially because you change the default value ofend
every year, right?). Still, it might be a good idea to include a startup message to mention this change, something similar tolrberge/fixest
maybe (see the message when loading the package)?
One more thing we have to think about is the place of status
(whether an obs is a forecast or not). Suppose that we change the default behavior of WDI()
so that it also imports values after 2020. Then someone who is not aware of the inclusion of status
in extra = TRUE
might be a bit surprised that there are values going up to 2023 or more. Maybe it could be a good thing to include a message such as "Some years in the data are after the current year. You can check the status of these observations by adding extra = TRUE
and look at the column status
". I'm just throwing the idea, maybe it is not good at all.
I hope it is clear. Anyway, there's no rush to do this change, so maybe give it some time to see what other people think about it. We could also find a strong counter-argument later.
Sounds good. I'm convinced by the small object size difference: this is very unlikely to be a major problem for download speed.
Frankly, I'm not sure there's a need for a warning. If people see 2023 as year
, it should be obvious to them that the data has not been measured yet. Besides, there's a ton of data from the World Bank that is not actually "measured", but rather estimated or the result of some modelling, and I don't think that WDI
should be responsible for issuing warning every time users request weird data.
Let's not add warnings.
(Of course, I remain open to counter-arguments, like you are. And I don't have much time to invest in this right now, so we can take as much time as needed to think about it. I'm just signalling that I'm basically fine to go ahead if you want to.)
Okay, I'm also not going to do this right now anyway ;)
Changed end=NULL
by default which sets 5 years into the future to make sure we get all the projections. I think it still makes sense to keep start=1960
by default, since we would just swap that in anyway.
Also note that we have a cool new latest
argument.
Will release to CRAN very soon.