linelist icon indicating copy to clipboard operation
linelist copied to clipboard

Fix spurious warnings in guess_dates

Open thibautjombart opened this issue 5 years ago • 8 comments

Sorry for I cannot share data to reproduce this, but I get the following on the latest version of linelist:

> x <- x %>%
+   mutate_at(.vars = vars(contains("date")),
+             .funs = guess_dates)
Warning messages:
1: In linelist::guess_dates(x, error_tolerance = error_tolerance, ...) : 
The following 1 dates were not in the correct timeframe (1969-05-19 -- 2019-05-19):

  original    |  parsed    
  --------    |  ------    
  2019-12-16  |  2019-12-16
2: In linelist::guess_dates(x, error_tolerance = error_tolerance, ...) : 
The following 2 dates were not in the correct timeframe (1969-05-19 -- 2019-05-19):

  original    |  parsed    
  --------    |  ------    
  2019-07-04  |  2019-07-04
  2019-10-21  |  2019-10-21
3: In linelist::guess_dates(x, error_tolerance = error_tolerance, ...) : 
The following 9 dates were not in the correct timeframe (1969-05-19 -- 2019-05-19):

  original    |  parsed    
  --------    |  ------    
  2019-08-08  |  2019-08-08
  2019-08-22  |  2019-08-22
  2019-09-02  |  2019-09-02
  2019-09-03  |  2019-09-03
  2019-09-11  |  2019-09-11
  2019-10-20  |  2019-10-20
  2019-11-02  |  2019-11-02
  2019-11-20  |  2019-11-20
  2019-12-12  |  2019-12-12
4: In linelist::guess_dates(x, error_tolerance = error_tolerance, ...) : 
The following 27 dates were not in the correct timeframe (1969-05-19 -- 2019-05-19):

  original    |  parsed    
  --------    |  ------    
  2019-08-08  |  2019-08-08
  2019-08-15  |  2019-08-15
  2019-08-16  |  2019-08-16
  2019-08-18  |  2019-08-18
  2019-08-19  |  2019-08-19
  2019-08-22  |  2019-08-22
  2019-08-30  |  2019-08-30
  2019-09-06  |  2019-09-06
  2019-09-14  |  2019-09-14
  2019-09-16  |  2019-09-16
  2019-09-17  |  2019-09-17
  2019-09-19  |  2019-09-19
  2019-09-21  |  2019-09-21
  2019-09-27  |  2019-09-27
  2019-10-04  |  2019-10-04
  2019-10-08  |  2019-10-08
  2019-10-10  |  2019-10-10
  2019-10-12  |  2019-10-12
  2019-10-13  |  2019-10-13
  2019-10-24  |  2019-10-24
  2019-10-25  |  2019-10-25
  2019-10-30  |  2019-10-30
  2019-10-31  |  2019-10-31
  2019-11-02  |  2019-11-02
  2019-11-09  |  2019-11-09
  2019-11-13  |  2019-11-13
  2019-12-14  |  2019-12-14
5: In linelist::guess_dates(x, error_tolerance = error_tolerance, ...) : 
The following 16 dates were not in the correct timeframe (1969-05-19 -- 2019-05-19):

  original    |  parsed    
  --------    |  ------    
  2019-08-17  |  2019-08-17
  2019-08-19  |  2019-08-19
  2019-09-04  |  2019-09-04
  2019-09-19  |  2019-09-19
  2019-09-20  |  2019-09-20
  2019-09-22  |  2019-09-22
  2019-09-28  |  2019-09-28
  2019-09-29  |  2019-09-29
  2019-10-13  |  2019-10-13
  2019-10-14  |  2019-10-14
  2019-10-16  |  2019-10-16
  2019-10-30  |  2019-10-30
  2019-11-03  |  2019-11-03
  2019-11-15  |  2019-11-15
  2019-11-17  |  2019-11-17
  2019-12-18  |  2019-12-18
> 

thibautjombart avatar May 19 '19 11:05 thibautjombart

I wouldn’t call this spurious. These dates are all beyond last_date. What behavior do you expect?

If you want to get rid of the warning, then set last_date = Sys.date() + 365

Sent from my iPhone

On May 19, 2019, at 12:34, Thibaut Jombart [email protected] wrote:

Sorry for I cannot share data to reproduce this, but I get the following on the latest version of linelist:

x <- x %>%

  • mutate_at(.vars = vars(contains("date")),
  •         .funs = guess_dates)
    

Warning messages: 1: In linelist::guess_dates(x, error_tolerance = error_tolerance, ...) : The following 1 dates were not in the correct timeframe (1969-05-19 -- 2019-05-19):

original parsed
2019-12-16 2019-12-16
2: In linelist::guess_dates(x, error_tolerance = error_tolerance, ...) :
The following 2 dates were not in the correct timeframe (1969-05-19 -- 2019-05-19):
original parsed
2019-07-04 2019-07-04
2019-10-21 2019-10-21
3: In linelist::guess_dates(x, error_tolerance = error_tolerance, ...) :
The following 9 dates were not in the correct timeframe (1969-05-19 -- 2019-05-19):
original parsed
2019-08-08 2019-08-08
2019-08-22 2019-08-22
2019-09-02 2019-09-02
2019-09-03 2019-09-03
2019-09-11 2019-09-11
2019-10-20 2019-10-20
2019-11-02 2019-11-02
2019-11-20 2019-11-20
2019-12-12 2019-12-12
4: In linelist::guess_dates(x, error_tolerance = error_tolerance, ...) :
The following 27 dates were not in the correct timeframe (1969-05-19 -- 2019-05-19):
original parsed
2019-08-08 2019-08-08
2019-08-15 2019-08-15
2019-08-16 2019-08-16
2019-08-18 2019-08-18
2019-08-19 2019-08-19
2019-08-22 2019-08-22
2019-08-30 2019-08-30
2019-09-06 2019-09-06
2019-09-14 2019-09-14
2019-09-16 2019-09-16
2019-09-17 2019-09-17
2019-09-19 2019-09-19
2019-09-21 2019-09-21
2019-09-27 2019-09-27
2019-10-04 2019-10-04
2019-10-08 2019-10-08
2019-10-10 2019-10-10
2019-10-12 2019-10-12
2019-10-13 2019-10-13
2019-10-24 2019-10-24
2019-10-25 2019-10-25
2019-10-30 2019-10-30
2019-10-31 2019-10-31
2019-11-02 2019-11-02
2019-11-09 2019-11-09
2019-11-13 2019-11-13
2019-12-14 2019-12-14
5: In linelist::guess_dates(x, error_tolerance = error_tolerance, ...) :
The following 16 dates were not in the correct timeframe (1969-05-19 -- 2019-05-19):
original parsed
2019-08-17 2019-08-17
2019-08-19 2019-08-19
2019-09-04 2019-09-04
2019-09-19 2019-09-19
2019-09-20 2019-09-20
2019-09-22 2019-09-22
2019-09-28 2019-09-28
2019-09-29 2019-09-29
2019-10-13 2019-10-13
2019-10-14 2019-10-14
2019-10-16 2019-10-16
2019-10-30 2019-10-30
2019-11-03 2019-11-03
2019-11-15 2019-11-15
2019-11-17 2019-11-17
2019-12-18 2019-12-18

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

zkamvar avatar May 19 '19 12:05 zkamvar

Problem is having long lists of dates original / parsed that are identical.

thibautjombart avatar May 20 '19 12:05 thibautjombart

Problem is having long lists of dates original / parsed that are identical.

Is the problem the length or the fact that they appear to be identical?

zkamvar avatar May 20 '19 12:05 zkamvar

To give a bit of background as to what is happening:

Because guess_dates() attempts to convert YMD, DMY, and MDY in that specific order, it's possible for some dates to fail because they were parsed incorrectly (e.g. the DMY date 11/02/2019 is interpreted as 2019-11-02 under the MDY system). These results are collected as they are parsed and then presented in a table as you saw. Usually it looks something like this:

library("linelist")
x <- c("04 Feb 1982", "19 Sep 2018", "2001-01-01", "2011.12.13",
       "ba;abb;a: 03:11:2012!", "haha... 2013-12-13..",
       "that's a NA", "gender", "not a date", "01__Feb__1999___", 
       "19/09/18", "09/08/18", "2018-08-09")
last_date <-as.Date("2012-11-05")
first_date <- as.Date("1962-11-05")
res <- guess_dates(x, error_tolerance = 1, last_date = last_date)
#> Warning in guess_dates(x, error_tolerance = 1, last_date = last_date): 
#> The following 5 dates were not in the correct timeframe (1962-11-05 -- 2012-11-05):
#> 
#>   original              |  parsed    
#>   --------              |  ------    
#>   09/08/18              |  2018-08-09
#>   09/08/18              |  2018-09-08
#>   19 Sep 2018           |  2018-09-19
#>   19/09/18              |  2018-09-19
#>   2018-08-09            |  2018-08-09
#>   haha... 2013-12-13..  |  2013-12-13
res
#>  [1] "1982-02-04" NA           "2001-01-01" "2011-12-13" "2012-11-03"
#>  [6] NA           NA           NA           NA           "1999-02-01"
#> [11] NA           NA           NA

Created on 2019-05-20 by the reprex package (v0.3.0)

Do you want me to get rid of this warning alltogether?

zkamvar avatar May 20 '19 13:05 zkamvar

I think it would already be a step forward if the warning could say which column is concerned. If guess_dates is called by clean_data you don't necessarily know which part of the warning comes from which column.

ffinger avatar Oct 27 '19 17:10 ffinger

I think it would already be a step forward if the warning could say which column is concerned. If guess_dates is called by clean_data you don't necessarily know which part of the warning comes from which column.

Thank you for adding this clarification, @ffinger, and I agree with you. Collecting warnings in a loop is not a straightforward problem, but luckily, I've already written some code to handle this situation in clean_variable_spelling() (see below) and can implement it in clean_dates() if you want.

I think adopting the warning pattern that readr::parse_date() uses will be helpful: https://readr.tidyverse.org/reference/parse_datetime.html

  library("linelist")
  my_data_frame <- data.frame(
    raboof    = c(letters[1:5], "foubar", "foobr", "fubar", "", "unknown", "fumar"),
    treatment = c(letters[5:1], "Y", "Yes", "N", NA, "No", "yes"),
    region    = state.name[1:11]
  )
  corrections <- data.frame(
    bad = c("foubar", "foobr", "fubar", ".missing", "unknown", "Yes", "Y", "No", "N", ".missing"),
    good = c("foobar", "foobar", "foobar", "missing", "missing", "yes", "yes", "no", "no", "missing"),
    column = c(rep("raboof", 5), rep("treatment", 5)),
    orders = c(1:5, 5:1),
    stringsAsFactors = FALSE
  )
  corr <- data.frame(bad = c(".default", ".default"),
                     good = c("check data", "check data"),
                     column = c("raboof", "treatment"),
                     orders = Inf,
                     stringsAsFactors = FALSE
  )
  corr <- rbind(corrections, corr)
   clean_variable_spelling(my_data_frame, corr, warn = TRUE)
#> Warning in clean_variable_spelling(my_data_frame, corr, warn = TRUE): The following warnings were found...
#>   raboof_____:
#>   .... 'a', 'b', 'c', 'd', 'e', 'fumar' were changed to the default value ('check data')
#>   treatment__:
#>   .... 'a', 'b', 'c', 'd', 'e' were changed to the default value ('check data')
#>        raboof  treatment      region
#> 1  check data check data     Alabama
#> 2  check data check data      Alaska
#> 3  check data check data     Arizona
#> 4  check data check data    Arkansas
#> 5  check data check data  California
#> 6      foobar        yes    Colorado
#> 7      foobar        yes Connecticut
#> 8      foobar         no    Delaware
#> 9     missing    missing     Florida
#> 10    missing         no     Georgia
#> 11 check data        yes      Hawaii

Created on 2019-10-28 by the reprex package (v0.3.0)

zkamvar avatar Oct 28 '19 10:10 zkamvar

I am getting warnings which look like they may not be appropriate. Example below

dates <- c("18_03_2020", "19_03_2020", "20_03_2020", "21_03_2020", "22_03_2020", 
"23_03_2020", "24_03_2020", "25_03_2020", "26_03_2020", "27_03_2020", 
"28_03_2020", "29_03_2020", "30_03_2020", "31_03_2020", "01_04_2020", 
"02_04_2020", "03_04_2020", "04_04_2020", "05_04_2020", "06_04_2020", 
"07_04_2020", "08_04_2020")

res <- linelist::guess_dates(dates)

gives the following warning:


Warning message:
In linelist::guess_dates(dates) : 
The following 4 dates were not in the correct timeframe (1970-04-10 -- 2020-04-10):

  original    |  parsed    
  --------    |  ------    
  05_04_2020  |  2020-05-04
  06_04_2020  |  2020-06-04
  07_04_2020  |  2020-07-04
  08_04_2020  |  2020-08-04

Which would suggest conversion did not go as planned, but it is actually not the case:

> res
 [1] "2020-03-18" "2020-03-19" "2020-03-20" "2020-03-21" "2020-03-22"
 [6] "2020-03-23" "2020-03-24" "2020-03-25" "2020-03-26" "2020-03-27"
[11] "2020-03-28" "2020-03-29" "2020-03-30" "2020-03-31" "2020-04-01"
[16] "2020-04-02" "2020-04-03" "2020-04-04" "2020-04-05" "2020-04-06"
[21] "2020-04-07" "2020-04-08"
> range(res)
[1] "2020-03-18" "2020-04-08"

thibautjombart avatar Apr 10 '20 10:04 thibautjombart

The warnings come from the fact that it's trying out both the "mdy" and "dmy" versions of the dates. If you only expect dmy versions of dates, then set orders = "dmy"

zkamvar avatar Apr 12 '20 22:04 zkamvar