parttime icon indicating copy to clipboard operation
parttime copied to clipboard

parttime does not handle single digit month and year

Open AmyMikhail opened this issue 1 year ago • 1 comments

I have a data set in which there are several date columns, and I need to use one (should be complete) for my analysis. Where the dates are completely missing in column 1, I try column 2 - if missing in column 2 I try column 3 etc until at the end I have a column of complete dates, using the first date available in order of preference.

I am using your package because one of my date columns (second last in order of preference) sometimes has incomplete dates, e.g. just a year or year and month. I need to impute a complete date (year, month, day) from these incomplete dates and then convert them to date, before considering them along with the others in order of preference.

To make matters even more complicated, some complete and incomplete dates have been written with a single digit for the month or day (leading 0 is missing). The one thing in my favour is that the order is always the same (i.e. always year, then month, then day, separated by a dash - so if it is the number or set of numbers after the first dash I know it is the month).

Suppose I have the following vector of dates:

# Character vector of dates including some incomplete and some single digit dates:
x <- c("2019", "2023-10", "2024-09-15", "2024-7-3", "2022-5")

Now if I try to impute the date with parttime::impute_date_min() it will parse the first three but not the last two:

# Impute date for partial dates:
y <- parttime::impute_date_min(x)
Warning message:
In vec_cast.partial_time.character(x, pttm, ..., format = format,  :
  Values could not be parsed (2 of 5 (40.0%)). Examples of unique failing formats:

    '2024-7-3'  '2022-5'    

# Check what has happened:
 > y
<partial_time<YMDhms+tz>[5]> 
[1] "2019-01-01" "2023-10-01" "2024-09-15" NA           NA   

If I use lubridate::ymd() it will correctly parse the complete dates with single digits but not the dates missing a whole element:

# Convert to date format with lubridate:
z <- lubridate::ymd(x)
Warning message:
 3 failed to parse. 

# Check what has happened:
> z
[1] NA           NA           "2024-09-15" "2024-07-03" NA  

I know that I could split the problematic date column into three (one each for year, month and day) then pad the single digit values with leading 0s, paste them back together as shown in this Posit post and only then run them through parttime::impute_date_min().

My question is: would it be possible to update the parttime package to handle these single digit cases, where the order of the date elements is already known?

AmyMikhail avatar Dec 03 '24 17:12 AmyMikhail

Thanks for taking the time to report this @AmyMikhail!

There are a couple things going on here. Just to break down the steps a bit, this breaks down into impute_time(as.partime(x)) where the parsing of the string is governed by the format argument of as.parttime, which defaults to a parser for the ISO8601 standard.

Unfortunately, YYYY-M(M?)-D(D?) is not part of the ISO8601 standard so this would require a more relaxed parsing format. For the immediate future, I think the solution you linked to will probably be the most actionable way of working around the input data.

In the future, if this is repeatedly a sticking point, I think this is a compelling use case for a more relaxed parser, though I think I'd want to keep the ISO8601 as default. It could look something like: impute_time_min(as.parttime(x, format = parse_ymd))

dgkf avatar Dec 03 '24 20:12 dgkf