lubridate icon indicating copy to clipboard operation
lubridate copied to clipboard

mdy seems to behave incorrectly when passed spelled out numbers.

Open JackLangerman opened this issue 6 years ago • 5 comments

mdy behaves incorrectly when passed spelled out numbers.

example: this works

mdy("June 3rd, 1972") [1] "1972-06-03"

this breaks

mdy("June Third, 1972") [1] "1972-06-19"

If this is not supported it seems like it should probably throw a warning unless I'm missing something.

sessionInfo() R version 3.5.0 (2018-04-23) Platform: x86_64-apple-darwin15.6.0 (64-bit) Running under: macOS High Sierra 10.13.4

Matrix products: default BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] lubridate_1.7.4 bindrcpp_0.2.2 dbplyr_1.2.1 forcats_0.3.0 stringr_1.3.1 dplyr_0.7.5
[7] purrr_0.2.5 readr_1.1.1 tidyr_0.8.1 tibble_1.4.2 ggplot2_2.2.1 tidyverse_1.2.1

loaded via a namespace (and not attached): [1] tidyselect_0.2.4 reshape2_1.4.3 haven_1.1.1 lattice_0.20-35 colorspace_1.3-2 [6] htmltools_0.3.6 yaml_2.1.19 utf8_1.1.4 rlang_0.2.1 pillar_1.2.3
[11] foreign_0.8-70 glue_1.2.0 DBI_1.0.0 modelr_0.1.2 readxl_1.1.0
[16] bindr_0.1.1 plyr_1.8.4 munsell_0.5.0 gtable_0.2.0 cellranger_1.1.0 [21] rvest_0.3.2 evaluate_0.10.1 psych_1.8.4 labeling_0.3 knitr_1.20
[26] parallel_3.5.0 broom_0.4.4 Rcpp_0.12.17 backports_1.1.2 scales_0.5.0
[31] jsonlite_1.5 mnormt_1.5-5 hms_0.4.2 digest_0.6.15 stringi_1.2.2
[36] rprojroot_1.3-2 grid_3.5.0 cli_1.0.0 tools_3.5.0 magrittr_1.5
[41] lazyeval_0.2.1 crayon_1.3.4 pkgconfig_2.0.1 xml2_1.2.0 assertthat_0.2.0 [46] rmarkdown_1.10 httr_1.3.1 rstudioapi_0.7 R6_2.2.2 nlme_3.1-137
[51] compiler_3.5.0

JackLangerman avatar Jun 14 '18 19:06 JackLangerman

Indeed, it's not supported and should create NAs.

vspinu avatar Jun 14 '18 21:06 vspinu

Further description of this issue: It appears that parse_date_time parses "Third" as a string which is ignored by sending the format "%B Third %d%y" to strptime; this results in the first two digits of 1972 being interpreted as the day and the last two as a two digit year. This causes completely wrong output when the two digit year is different than the four digit year as in:

mdy("June Third 1918") 1 parsed with %B Third %d%y [1] "2018-06-19"

When passed a two digit year it correctly returns NA:

mdy("June 3rd, 72") 1 parsed with %B %drd, %y [1] "1972-06-03" mdy("June third, 72") [1] NA Warning message: All formats failed to parse. No formats found.

The same behavior is what allows the following correct result:

mdy("foo June 4th bar 1975") 1 parsed with foo %B %dth bar %Y [1] "1975-06-04"

The lack of NA being passed here seems to stem from the fact that the four digit year can be interpreted ambiguously as either a four digit year or a set of two digit day and year. When the string "Third" is ignored the remaining input is parsed incorrectly as a valid input containing month day and year.

All solutions I can think of cause some loss of functionality -- if mixing delimiters was disallowed the input would be properly categorized as invalid but valid inputs such as

mdy("06/03 1969") 1 parsed with %Om/%d %Y [1] "1969-06-03"

would no longer function. Similarly creating NAs when there are elements which cannot be parsed by strptime would disable using things like 3rd and 4th as in

mdy("june 3rd 1969") 1 parsed with %B %drd %Y [1] "1969-06-03"

spencerlangerman avatar Jun 15 '18 01:06 spencerlangerman

@vspinu any thoughts on this? I think @spencerlangerman and I are both willing to put in a little time to clean this up a bit, but don't want to start doing things unless there is consensus about what should be done.

Does one of @spencerlangerman 's solutions seem preferable to you? Do you see a third way?

(unless people think this is too trivial of an edge case in which case maybe it is worth a note in the docs about this behavior?)

JackLangerman avatar Jun 20 '18 18:06 JackLangerman

The lack of NA being passed here seems to stem from the fact that the four digit year can be interpreted ambiguously as either a four digit year or a set of two digit day and year.

Indeed. There is hardly anything that could be done here without losing functionality.

On the lubridate rewrite (extraction of the parser in particular) I will tighten some of the restrictions to avoid such issues.

For now you will have to explicitly use parse_date_time in order to avoid such issues:

> parse_date_time("June Third, 1972", "BdY")
[1] NA
Warning message:
All formats failed to parse. No formats found. 
> parse_date_time2("June Third, 1972", "BdY")
[1] NA

if mixing delimiters was disallowed the input

Indeed. Most likely I will restrict same type delimiter within YMD and HMS parts (including the "empty delimiter" which is essentially the cause of this issue).

vspinu avatar Jun 20 '18 19:06 vspinu

I think the basic problem is this:

library(lubridate, warn.conflicts = FALSE)
guess_formats("June Third, 1972", "mdy")
#>              Omdy               mdy 
#> "%Om Third, %d%y"  "%B Third, %d%y"

Created on 2019-11-19 by the reprex package (v0.3.0)

My suspicion is that allowing %d%y without delimiter (except for %m%d%y) is probably a bit too dangerous in general, but it's hard to know how adding restrictions here would affect code in the wild.

hadley avatar Nov 19 '19 22:11 hadley