WDI icon indicating copy to clipboard operation
WDI copied to clipboard

Change `start` and `end` defaults to NULL

Open vincentarelbundock opened this issue 3 years ago • 5 comments

In PR #48 @etiennebacher writes:

Question about defaults

The observation status is particularly useful when we import a particular indicator without specifying the dates. On the World Bank API, when no dates are specified, the whole serie is returned. E.g for this indicator, the data goes until 2023, which is why knowing which observation is a forecast is helpful. However, the behavior of WDI() regarding dates is a bit strange:

  • if I don't specify any start and end dates, the data will be downloaded from the year of creation of the indicator (1960 in most cases) to 2020
WDI("CHN", "NYGDPMKTPKDZ")
   iso2c country NYGDPMKTPKDZ year
1     CN   China          2.3 2020
2     CN   China          6.0 2019
...
21    CN   China           NA 2000
22    CN   China           NA 1999
  • if I specify end = NULL then it gets until the most recent year
> WDI("CHN", "NYGDPMKTPKDZ", end = NULL)
   iso2c country NYGDPMKTPKDZ year
1     CN   China          5.3 2023
2     CN   China          5.4 2022
3     CN   China          8.5 2021
4     CN   China          2.3 2020
...
24    CN   China           NA 2000
25    CN   China           NA 1999
  • if I specify start = NULL and end = NULL, then it errors
> WDI("CHN", "NYGDPMKTPKDZ", start = NULL, end = NULL)
Error in WDI("CHN", "NYGDPMKTPKDZ", start = NULL, end = NULL) : 
  Need to specify dates or number of latest values.

So I'm wondering if it would be better to change the default values for start and end, and to change the default behavior. To me, both start and end should be NULL if I don't specify anything, and if they are NULL then it should get all years, even after 2020. I don't know if this would be a breaking change in the code of other people. What do you think?

vincentarelbundock avatar Jul 22 '21 12:07 vincentarelbundock

As far as I can remember, there are two main reasons why these are the start and end defaults:

  1. In older versions of the World Bank API, the dates were actually mandatory, so I had no choice but to include dates.
  2. I'm a bit worried about the size of datasets for people with slow internet connections.

Then there is, as you pointed out, the possibility of:

  1. Backward incompatible breakage of people's code.

Issue 1 is no longer a concern with the modern API, so we can ignore.

Do you think Issue 2 is a real problem? My guess is maybe not.

I'm sure someone out there relies on hard-coded default dates, but that feels like bad practice. In general, NULL seems like a more natural and sane default, so unless we can think of a clear problematic case, and if we can convince ourselves that the slow internet is not a problem, then I'd be in favor of changing this.

vincentarelbundock avatar Jul 22 '21 12:07 vincentarelbundock

  1. I don't know if it will be a problem for slow internet connections, and we don't really have a way to test this, right? The only thing I tried is to compare the size of the imported data with and without the latest values (i.e when end = 2020 and end = NULL):
library(WDI)

x <- WDI("all", "NYGDPMKTPKDZ", end = 2020)
y <- WDI("all", "NYGDPMKTPKDZ", end = NULL)

> dim(x)
[1] 3212    4
> dim(y)
[1] 3650    4
> object.size(x)
108696 bytes
> object.size(y)
120960 bytes

The difference here is ~12 kB, and I don't think it would add a lot of time to download, even with slow connections, but I'm no expert about that. Of course this is not a exhaustive test, maybe the difference is much bigger if one imports a lot of indicators.

  1. Yeah, basically the new version would return unexpected data if someone didn't specify start and end. I cannot think of a situation where someone would do that without wanting to get the latest years, or without filtering the data afterwards (especially because you change the default value of end every year, right?). Still, it might be a good idea to include a startup message to mention this change, something similar to lrberge/fixest maybe (see the message when loading the package)?

One more thing we have to think about is the place of status (whether an obs is a forecast or not). Suppose that we change the default behavior of WDI() so that it also imports values after 2020. Then someone who is not aware of the inclusion of status in extra = TRUE might be a bit surprised that there are values going up to 2023 or more. Maybe it could be a good thing to include a message such as "Some years in the data are after the current year. You can check the status of these observations by adding extra = TRUE and look at the column status". I'm just throwing the idea, maybe it is not good at all.

I hope it is clear. Anyway, there's no rush to do this change, so maybe give it some time to see what other people think about it. We could also find a strong counter-argument later.

etiennebacher avatar Jul 22 '21 14:07 etiennebacher

Sounds good. I'm convinced by the small object size difference: this is very unlikely to be a major problem for download speed.

Frankly, I'm not sure there's a need for a warning. If people see 2023 as year, it should be obvious to them that the data has not been measured yet. Besides, there's a ton of data from the World Bank that is not actually "measured", but rather estimated or the result of some modelling, and I don't think that WDI should be responsible for issuing warning every time users request weird data.

Let's not add warnings.

vincentarelbundock avatar Jul 22 '21 14:07 vincentarelbundock

(Of course, I remain open to counter-arguments, like you are. And I don't have much time to invest in this right now, so we can take as much time as needed to think about it. I'm just signalling that I'm basically fine to go ahead if you want to.)

vincentarelbundock avatar Jul 22 '21 14:07 vincentarelbundock

Okay, I'm also not going to do this right now anyway ;)

etiennebacher avatar Jul 22 '21 14:07 etiennebacher

Changed end=NULL by default which sets 5 years into the future to make sure we get all the projections. I think it still makes sense to keep start=1960 by default, since we would just swap that in anyway.

Also note that we have a cool new latest argument.

Will release to CRAN very soon.

vincentarelbundock avatar Aug 24 '22 21:08 vincentarelbundock