eurostat
eurostat copied to clipboard
Problem with week time code
It seems eurostat
(more specifically, eurotime2date
) can't handle weekly data:
temp <- eurostat::get_eurostat("demo_r_mweek3")
#> Warning in eurotime2date(x, last = FALSE): Unknown time code, W. No date conversion was made.
#>
#> Please fill bug report at https://github.com/rOpenGov/eurostat/issues.
#> Table demo_r_mweek3 cached at C:\Users\FERENC~1\AppData\Local\Temp\RtmpsNaBz8/eurostat/demo_r_mweek3_date_code_FF.rds
No, it doesn't. I think weekly data is relatively new addition in Eurostat.
I thought that id would be easyly fixed, but
- it seems as.Date does not support ISO weeks (%V), as least not on Windows (https://stackoverflow.com/questions/45549449/transform-year-week-to-date-object/45587644#45587644),
- nor does lubridate (https://github.com/tidyverse/lubridate/issues/506)
However, there seems to be a ISOweek package: https://cran.r-project.org/web/packages/ISOweek/, which I guess gives right dates. Or we could use UK week defination (there is some difference in starting week).
Then there seems to be also week W99. How, that is supposed to be treated?
Yes, I personally decided to use ISOweek
package too in a similar situation. You definitely need the 8601 standard; the metadata says - for my particular example - that "the definition of ‘week’ is given by ISO8601 week number" (https://ec.europa.eu/eurostat/cache/metadata/en/demomwk_esms.htm).
99 means that the week is not known (to cite the same source: "W99 means ‘unknown week’.").
As it is converted to a Date, on what date a W99 should be converted? The last day of the last week?
Very good question. Definitely not the last week, as it'd imply that all people with unknown death date died on the last week, i.e. they'd be pooled together with those who indeed died on the last week. I don't know whether it breaks any consistency within eurostat
, but perhaps the most clear solution would be to set their date to NA
...
But then we would lose year information.
I thought that last week would have information on two dates. Dated infromation on the first day, as normal, and unknown on the last day.
Ah, I forget that, you're completely correct.
I am no expert in designing such things, but what you outlined seems to be a possible solution. Although the user has to be very clearly informed in this case what do those dates exactly mean (and also generally, that while there is a concrete date, the data pertains to a week).
FWIW, {ISOweek} is now the correct solution, I think - I just ended up using it on the same data (national, not Eurostat, but produced to the same standard). Perhaps https://github.com/tidyverse/lubridate/issues/506#issuecomment-770310175 may also be helpful.
And thanks for {eurostat}, very helpful!
My solution is to filter out the data from W99, which definitely is not a clean solution, but given it only affects Hungary/Latvia and Sweden.... its a workaround.
W99 values by geo and year:
geo/year | 2000 | 2001 | 2002 | 2003 | 2004 | 2005 | 2006 | 2007 | 2008 | 2009 | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | 2021 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
HU | 5 | 4 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
LV | 90 | 72 | 63 | 41 | 19 | 33 | 29 | 33 | 33 | 19 | 20 | 13 | 18 | NA | NA | NA | NA | NA | NA | NA | NA | NA |
SE | 1493 | 1534 | 1520 | 1437 | 1137 | 822 | 749 | 650 | 515 | 538 | 402 | 428 | 439 | 464 | 486 | 960 | 1963 | 2230 | 2513 | 2616 | 2663 | 713 |
So right know I have this code working fine:
df <- demo_r_mwk_ts%>%
# extract year
# extract weeknr
mutate(year=substr(time,1,4),
week=substr(time,6,7))%>%
#filter out week 99
filter(week!=99)%>%
# create date using "ISOweek" package
mutate(date=ISOweek:::ISOweek2date(paste0(year,"-W",week,"-1")))
The best way would be if Eurostat would divide the W99 values and assign them to each week of the year accordingly to known values week "weights". If anybody works with countries, that have W99 data, then I would suggest to do this manually.
If this is a common need, would it be feasible to have an additional enrichment function that could be run after data retrieval?
"The best way would be if Eurostat would divide the W99 values and assign them to each week of the year accordingly to known values week "weights". If anybody works with countries, that have W99 data, then I would suggest to do this manually." I completely agree. As a minimum solution, proportionally increasing all values would work in my opinion. (At least if the proportion of values reported for W99 is small compared to the total.)
I did some testing with the dataset mentioned here and I have to say fixing this weekly data issue was easier than figuring out how to efficiently handle this dataset with 110 million row (after pivot_longer). 16 GB of RAM wasn't apparently enough the way it was done before. The results are in commit cfdaf37 of the v4-dev branch (version 4.0.0.9002).
Based on the discussion here I couldn't figure out a sensible solution to W99 values. Drop it? Assign it to the last day of the year? Distribute the values evenly for the whole year? In my solution I coerced it to the first day of the first week of the year and the function prints a warning message for the user, suggesting to use get_eurostat(time_format = "raw")
if they wish to wrangle the data manually. Might not be optimal and I'd love to hear your thoughts on the matter.
Closed with the CRAN release of package version 4.0.0