linelist
linelist copied to clipboard
Fix spurious warnings in guess_dates
Sorry for I cannot share data to reproduce this, but I get the following on the latest version of linelist:
> x <- x %>%
+ mutate_at(.vars = vars(contains("date")),
+ .funs = guess_dates)
Warning messages:
1: In linelist::guess_dates(x, error_tolerance = error_tolerance, ...) :
The following 1 dates were not in the correct timeframe (1969-05-19 -- 2019-05-19):
original | parsed
-------- | ------
2019-12-16 | 2019-12-16
2: In linelist::guess_dates(x, error_tolerance = error_tolerance, ...) :
The following 2 dates were not in the correct timeframe (1969-05-19 -- 2019-05-19):
original | parsed
-------- | ------
2019-07-04 | 2019-07-04
2019-10-21 | 2019-10-21
3: In linelist::guess_dates(x, error_tolerance = error_tolerance, ...) :
The following 9 dates were not in the correct timeframe (1969-05-19 -- 2019-05-19):
original | parsed
-------- | ------
2019-08-08 | 2019-08-08
2019-08-22 | 2019-08-22
2019-09-02 | 2019-09-02
2019-09-03 | 2019-09-03
2019-09-11 | 2019-09-11
2019-10-20 | 2019-10-20
2019-11-02 | 2019-11-02
2019-11-20 | 2019-11-20
2019-12-12 | 2019-12-12
4: In linelist::guess_dates(x, error_tolerance = error_tolerance, ...) :
The following 27 dates were not in the correct timeframe (1969-05-19 -- 2019-05-19):
original | parsed
-------- | ------
2019-08-08 | 2019-08-08
2019-08-15 | 2019-08-15
2019-08-16 | 2019-08-16
2019-08-18 | 2019-08-18
2019-08-19 | 2019-08-19
2019-08-22 | 2019-08-22
2019-08-30 | 2019-08-30
2019-09-06 | 2019-09-06
2019-09-14 | 2019-09-14
2019-09-16 | 2019-09-16
2019-09-17 | 2019-09-17
2019-09-19 | 2019-09-19
2019-09-21 | 2019-09-21
2019-09-27 | 2019-09-27
2019-10-04 | 2019-10-04
2019-10-08 | 2019-10-08
2019-10-10 | 2019-10-10
2019-10-12 | 2019-10-12
2019-10-13 | 2019-10-13
2019-10-24 | 2019-10-24
2019-10-25 | 2019-10-25
2019-10-30 | 2019-10-30
2019-10-31 | 2019-10-31
2019-11-02 | 2019-11-02
2019-11-09 | 2019-11-09
2019-11-13 | 2019-11-13
2019-12-14 | 2019-12-14
5: In linelist::guess_dates(x, error_tolerance = error_tolerance, ...) :
The following 16 dates were not in the correct timeframe (1969-05-19 -- 2019-05-19):
original | parsed
-------- | ------
2019-08-17 | 2019-08-17
2019-08-19 | 2019-08-19
2019-09-04 | 2019-09-04
2019-09-19 | 2019-09-19
2019-09-20 | 2019-09-20
2019-09-22 | 2019-09-22
2019-09-28 | 2019-09-28
2019-09-29 | 2019-09-29
2019-10-13 | 2019-10-13
2019-10-14 | 2019-10-14
2019-10-16 | 2019-10-16
2019-10-30 | 2019-10-30
2019-11-03 | 2019-11-03
2019-11-15 | 2019-11-15
2019-11-17 | 2019-11-17
2019-12-18 | 2019-12-18
>
I wouldn’t call this spurious. These dates are all beyond last_date. What behavior do you expect?
If you want to get rid of the warning, then set last_date = Sys.date() + 365
Sent from my iPhone
On May 19, 2019, at 12:34, Thibaut Jombart [email protected] wrote:
Sorry for I cannot share data to reproduce this, but I get the following on the latest version of linelist:
x <- x %>%
- mutate_at(.vars = vars(contains("date")),
.funs = guess_dates)
Warning messages: 1: In linelist::guess_dates(x, error_tolerance = error_tolerance, ...) : The following 1 dates were not in the correct timeframe (1969-05-19 -- 2019-05-19):
original parsed 2019-12-16 2019-12-16 2: In linelist::guess_dates(x, error_tolerance = error_tolerance, ...) : The following 2 dates were not in the correct timeframe (1969-05-19 -- 2019-05-19):
original parsed 2019-07-04 2019-07-04 2019-10-21 2019-10-21 3: In linelist::guess_dates(x, error_tolerance = error_tolerance, ...) : The following 9 dates were not in the correct timeframe (1969-05-19 -- 2019-05-19):
original parsed 2019-08-08 2019-08-08 2019-08-22 2019-08-22 2019-09-02 2019-09-02 2019-09-03 2019-09-03 2019-09-11 2019-09-11 2019-10-20 2019-10-20 2019-11-02 2019-11-02 2019-11-20 2019-11-20 2019-12-12 2019-12-12 4: In linelist::guess_dates(x, error_tolerance = error_tolerance, ...) : The following 27 dates were not in the correct timeframe (1969-05-19 -- 2019-05-19):
original parsed 2019-08-08 2019-08-08 2019-08-15 2019-08-15 2019-08-16 2019-08-16 2019-08-18 2019-08-18 2019-08-19 2019-08-19 2019-08-22 2019-08-22 2019-08-30 2019-08-30 2019-09-06 2019-09-06 2019-09-14 2019-09-14 2019-09-16 2019-09-16 2019-09-17 2019-09-17 2019-09-19 2019-09-19 2019-09-21 2019-09-21 2019-09-27 2019-09-27 2019-10-04 2019-10-04 2019-10-08 2019-10-08 2019-10-10 2019-10-10 2019-10-12 2019-10-12 2019-10-13 2019-10-13 2019-10-24 2019-10-24 2019-10-25 2019-10-25 2019-10-30 2019-10-30 2019-10-31 2019-10-31 2019-11-02 2019-11-02 2019-11-09 2019-11-09 2019-11-13 2019-11-13 2019-12-14 2019-12-14 5: In linelist::guess_dates(x, error_tolerance = error_tolerance, ...) : The following 16 dates were not in the correct timeframe (1969-05-19 -- 2019-05-19):
original parsed 2019-08-17 2019-08-17 2019-08-19 2019-08-19 2019-09-04 2019-09-04 2019-09-19 2019-09-19 2019-09-20 2019-09-20 2019-09-22 2019-09-22 2019-09-28 2019-09-28 2019-09-29 2019-09-29 2019-10-13 2019-10-13 2019-10-14 2019-10-14 2019-10-16 2019-10-16 2019-10-30 2019-10-30 2019-11-03 2019-11-03 2019-11-15 2019-11-15 2019-11-17 2019-11-17 2019-12-18 2019-12-18 — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.
Problem is having long lists of dates original / parsed that are identical.
Problem is having long lists of dates original / parsed that are identical.
Is the problem the length or the fact that they appear to be identical?
To give a bit of background as to what is happening:
Because guess_dates()
attempts to convert YMD, DMY, and MDY in that specific order, it's possible for some dates to fail because they were parsed incorrectly (e.g. the DMY date 11/02/2019 is interpreted as 2019-11-02 under the MDY system). These results are collected as they are parsed and then presented in a table as you saw. Usually it looks something like this:
library("linelist")
x <- c("04 Feb 1982", "19 Sep 2018", "2001-01-01", "2011.12.13",
"ba;abb;a: 03:11:2012!", "haha... 2013-12-13..",
"that's a NA", "gender", "not a date", "01__Feb__1999___",
"19/09/18", "09/08/18", "2018-08-09")
last_date <-as.Date("2012-11-05")
first_date <- as.Date("1962-11-05")
res <- guess_dates(x, error_tolerance = 1, last_date = last_date)
#> Warning in guess_dates(x, error_tolerance = 1, last_date = last_date):
#> The following 5 dates were not in the correct timeframe (1962-11-05 -- 2012-11-05):
#>
#> original | parsed
#> -------- | ------
#> 09/08/18 | 2018-08-09
#> 09/08/18 | 2018-09-08
#> 19 Sep 2018 | 2018-09-19
#> 19/09/18 | 2018-09-19
#> 2018-08-09 | 2018-08-09
#> haha... 2013-12-13.. | 2013-12-13
res
#> [1] "1982-02-04" NA "2001-01-01" "2011-12-13" "2012-11-03"
#> [6] NA NA NA NA "1999-02-01"
#> [11] NA NA NA
Created on 2019-05-20 by the reprex package (v0.3.0)
Do you want me to get rid of this warning alltogether?
I think it would already be a step forward if the warning could say which column is concerned. If guess_dates is called by clean_data you don't necessarily know which part of the warning comes from which column.
I think it would already be a step forward if the warning could say which column is concerned. If guess_dates is called by clean_data you don't necessarily know which part of the warning comes from which column.
Thank you for adding this clarification, @ffinger, and I agree with you. Collecting warnings in a loop is not a straightforward problem, but luckily, I've already written some code to handle this situation in clean_variable_spelling()
(see below) and can implement it in clean_dates()
if you want.
I think adopting the warning pattern that readr::parse_date() uses will be helpful: https://readr.tidyverse.org/reference/parse_datetime.html
library("linelist")
my_data_frame <- data.frame(
raboof = c(letters[1:5], "foubar", "foobr", "fubar", "", "unknown", "fumar"),
treatment = c(letters[5:1], "Y", "Yes", "N", NA, "No", "yes"),
region = state.name[1:11]
)
corrections <- data.frame(
bad = c("foubar", "foobr", "fubar", ".missing", "unknown", "Yes", "Y", "No", "N", ".missing"),
good = c("foobar", "foobar", "foobar", "missing", "missing", "yes", "yes", "no", "no", "missing"),
column = c(rep("raboof", 5), rep("treatment", 5)),
orders = c(1:5, 5:1),
stringsAsFactors = FALSE
)
corr <- data.frame(bad = c(".default", ".default"),
good = c("check data", "check data"),
column = c("raboof", "treatment"),
orders = Inf,
stringsAsFactors = FALSE
)
corr <- rbind(corrections, corr)
clean_variable_spelling(my_data_frame, corr, warn = TRUE)
#> Warning in clean_variable_spelling(my_data_frame, corr, warn = TRUE): The following warnings were found...
#> raboof_____:
#> .... 'a', 'b', 'c', 'd', 'e', 'fumar' were changed to the default value ('check data')
#> treatment__:
#> .... 'a', 'b', 'c', 'd', 'e' were changed to the default value ('check data')
#> raboof treatment region
#> 1 check data check data Alabama
#> 2 check data check data Alaska
#> 3 check data check data Arizona
#> 4 check data check data Arkansas
#> 5 check data check data California
#> 6 foobar yes Colorado
#> 7 foobar yes Connecticut
#> 8 foobar no Delaware
#> 9 missing missing Florida
#> 10 missing no Georgia
#> 11 check data yes Hawaii
Created on 2019-10-28 by the reprex package (v0.3.0)
I am getting warnings which look like they may not be appropriate. Example below
dates <- c("18_03_2020", "19_03_2020", "20_03_2020", "21_03_2020", "22_03_2020",
"23_03_2020", "24_03_2020", "25_03_2020", "26_03_2020", "27_03_2020",
"28_03_2020", "29_03_2020", "30_03_2020", "31_03_2020", "01_04_2020",
"02_04_2020", "03_04_2020", "04_04_2020", "05_04_2020", "06_04_2020",
"07_04_2020", "08_04_2020")
res <- linelist::guess_dates(dates)
gives the following warning:
Warning message:
In linelist::guess_dates(dates) :
The following 4 dates were not in the correct timeframe (1970-04-10 -- 2020-04-10):
original | parsed
-------- | ------
05_04_2020 | 2020-05-04
06_04_2020 | 2020-06-04
07_04_2020 | 2020-07-04
08_04_2020 | 2020-08-04
Which would suggest conversion did not go as planned, but it is actually not the case:
> res
[1] "2020-03-18" "2020-03-19" "2020-03-20" "2020-03-21" "2020-03-22"
[6] "2020-03-23" "2020-03-24" "2020-03-25" "2020-03-26" "2020-03-27"
[11] "2020-03-28" "2020-03-29" "2020-03-30" "2020-03-31" "2020-04-01"
[16] "2020-04-02" "2020-04-03" "2020-04-04" "2020-04-05" "2020-04-06"
[21] "2020-04-07" "2020-04-08"
> range(res)
[1] "2020-03-18" "2020-04-08"
The warnings come from the fact that it's trying out both the "mdy" and "dmy" versions of the dates. If you only expect dmy versions of dates, then set orders = "dmy"