lubridate
lubridate copied to clipboard
mdy seems to behave incorrectly when passed spelled out numbers.
mdy behaves incorrectly when passed spelled out numbers.
example: this works
mdy("June 3rd, 1972") [1] "1972-06-03"
this breaks
mdy("June Third, 1972") [1] "1972-06-19"
sessionInfo() R version 3.5.0 (2018-04-23) Platform: x86_64-apple-darwin15.6.0 (64-bit) Running under: macOS High Sierra 10.13.4
Matrix products: default BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages: [1] stats graphics grDevices utils datasets methods base
other attached packages: [1] lubridate_1.7.4 bindrcpp_0.2.2 dbplyr_1.2.1 forcats_0.3.0 stringr_1.3.1 dplyr_0.7.5
[7] purrr_0.2.5 readr_1.1.1 tidyr_0.8.1 tibble_1.4.2 ggplot2_2.2.1 tidyverse_1.2.1
loaded via a namespace (and not attached): [1] tidyselect_0.2.4 reshape2_1.4.3 haven_1.1.1 lattice_0.20-35 colorspace_1.3-2 [6] htmltools_0.3.6 yaml_2.1.19 utf8_1.1.4 rlang_0.2.1 pillar_1.2.3
[11] foreign_0.8-70 glue_1.2.0 DBI_1.0.0 modelr_0.1.2 readxl_1.1.0
[16] bindr_0.1.1 plyr_1.8.4 munsell_0.5.0 gtable_0.2.0 cellranger_1.1.0 [21] rvest_0.3.2 evaluate_0.10.1 psych_1.8.4 labeling_0.3 knitr_1.20
[26] parallel_3.5.0 broom_0.4.4 Rcpp_0.12.17 backports_1.1.2 scales_0.5.0
[31] jsonlite_1.5 mnormt_1.5-5 hms_0.4.2 digest_0.6.15 stringi_1.2.2
[36] rprojroot_1.3-2 grid_3.5.0 cli_1.0.0 tools_3.5.0 magrittr_1.5
[41] lazyeval_0.2.1 crayon_1.3.4 pkgconfig_2.0.1 xml2_1.2.0 assertthat_0.2.0 [46] rmarkdown_1.10 httr_1.3.1 rstudioapi_0.7 R6_2.2.2 nlme_3.1-137
[51] compiler_3.5.0
Indeed, it's not supported and should create NAs.
Further description of this issue: It appears that parse_date_time parses "Third" as a string which is ignored by sending the format "%B Third %d%y" to strptime; this results in the first two digits of 1972 being interpreted as the day and the last two as a two digit year. This causes completely wrong output when the two digit year is different than the four digit year as in:
mdy("June Third 1918") 1 parsed with %B Third %d%y [1] "2018-06-19"
When passed a two digit year it correctly returns NA:
mdy("June 3rd, 72") 1 parsed with %B %drd, %y [1] "1972-06-03" mdy("June third, 72") [1] NA Warning message: All formats failed to parse. No formats found.
The same behavior is what allows the following correct result:
mdy("foo June 4th bar 1975") 1 parsed with foo %B %dth bar %Y [1] "1975-06-04"
The lack of NA being passed here seems to stem from the fact that the four digit year can be interpreted ambiguously as either a four digit year or a set of two digit day and year. When the string "Third" is ignored the remaining input is parsed incorrectly as a valid input containing month day and year.
All solutions I can think of cause some loss of functionality -- if mixing delimiters was disallowed the input would be properly categorized as invalid but valid inputs such as
mdy("06/03 1969") 1 parsed with %Om/%d %Y [1] "1969-06-03"
would no longer function. Similarly creating NAs when there are elements which cannot be parsed by strptime would disable using things like 3rd and 4th as in
mdy("june 3rd 1969") 1 parsed with %B %drd %Y [1] "1969-06-03"
@vspinu any thoughts on this? I think @spencerlangerman and I are both willing to put in a little time to clean this up a bit, but don't want to start doing things unless there is consensus about what should be done.
Does one of @spencerlangerman 's solutions seem preferable to you? Do you see a third way?
(unless people think this is too trivial of an edge case in which case maybe it is worth a note in the docs about this behavior?)
The lack of NA being passed here seems to stem from the fact that the four digit year can be interpreted ambiguously as either a four digit year or a set of two digit day and year.
Indeed. There is hardly anything that could be done here without losing functionality.
On the lubridate rewrite (extraction of the parser in particular) I will tighten some of the restrictions to avoid such issues.
For now you will have to explicitly use parse_date_time
in order to avoid such issues:
> parse_date_time("June Third, 1972", "BdY")
[1] NA
Warning message:
All formats failed to parse. No formats found.
> parse_date_time2("June Third, 1972", "BdY")
[1] NA
if mixing delimiters was disallowed the input
Indeed. Most likely I will restrict same type delimiter within YMD and HMS parts (including the "empty delimiter" which is essentially the cause of this issue).
I think the basic problem is this:
library(lubridate, warn.conflicts = FALSE)
guess_formats("June Third, 1972", "mdy")
#> Omdy mdy
#> "%Om Third, %d%y" "%B Third, %d%y"
Created on 2019-11-19 by the reprex package (v0.3.0)
My suspicion is that allowing %d%y
without delimiter (except for %m%d%y
) is probably a bit too dangerous in general, but it's hard to know how adding restrictions here would affect code in the wild.