validate icon indicating copy to clipboard operation
validate copied to clipboard

Checking date format while accounting for missing values

Open luenhchang opened this issue 5 years ago • 1 comments

Hi Mark, Thank you so much for creating this package. I am really loving it.

I'd like to validate my date columns and check if they are all in a consistent date format, say %Y-%m-%d, while checking for missingness. I think it is probably more understandable to explain what I want with three if statements: (1) If a value is in that format, it will add 1 to the passes column. (2) If a value is in another format, it will add 1 to the fails column. (3) If a missing value is found, it will add 1 to the nNA column.

I've tried is_complete(as.Date(date), format= "%Y-%m-%d") but unfortunately I don't know how to get the validator() to account for NA. I think zero in nNA is because of the effect of is_complete() checking only those that are not missing values. I got a syntax error when I removed the function is_complete(). How do I tweak my rule and get something in the nNA column?

Here is my code and output

# Create some data
t <- data.frame( ID=c("0010","0011","001",NA,"11115")
                ,date1=c("2008-11-01", "2008-11-02",NA,"2008-11-03","04-11-2008")
                ,date2=c("2009-11-01", "2009-11-02","2009-11-03",NA,"04-11-2009")
                ,DOB=c("2000-11-01", "2001-11-02",NA,"2002-11-03","04-11-2007")
                ,stringsAsFactors = FALSE)

# Change character columns to datetime
t$date1 <- as.POSIXct(t$date1, format= "%Y-%m-%d")
t$date2 <- as.POSIXct(t$date2, format= "%Y-%m-%d")
t$DOB<- as.POSIXct(t$DOB, format= "%Y-%m-%d")

# Write validation rules
rules <- validator(check.field.length.ID= nchar(ID)==4
                   ,check.format.date1=is_complete(as.Date(date1), format= "%Y-%m-%d")
                   ,check.date2.GTE.date1= date2 > date1)
# Check data against the validation rules
out <- confront(dat=t, x=rules)
summary(out)
#                    name items passes fails nNA error warning                                       expression
# 1 check.field.length.ID     5      2     2   1 FALSE   FALSE                                   nchar(ID) == 4
# 2    check.format.date1     5      4     1   0 FALSE   FALSE is_complete(as.Date(date1), format = "%Y-%m-%d")
# 3 check.date2.GTE.date1     5      2     1   2 FALSE   FALSE                                    date2 > date1

Cheers, Chang

luenhchang avatar Nov 19 '20 03:11 luenhchang

Dear Chang,

Thank you for your interest in validate. The question you ask is a bit different from how R 'thinks'. You can either ask: is a column formatted according to "%Y-%m-%d", or you can ask R where a column has NA's.

If you check for a certain format, a field with NA will always return FALSE, because it is not formatted correctly. So there is no simple rule to do this. That's the bad news. The good news is, that you can choose to write your own function that does have this behavior.

check_format <- function(x){ ifelse(is.na(x), NA, !is.na(as.Date(x,"%Y-%m-%d"))) }

rules <- validator(check_format(date1) == TRUE) out <- confront(t, rules)

summary(out) name items passes fails nNA error warning expression 1 V1 5 4 0 1 FALSE FALSE check_format(date1) == TRUE

Note that you have to do check_format(date1) == TRUE, because validate does not know that check_format is a function that returns TRUE/FALSE/NA. The == TRUE part solves this.

Hope this helps, Mark

ps: thank you for creating a simple example using only base R. This makes it a lot easier for me to answer questions!

On Thu, Nov 19, 2020 at 4:20 AM Lun-Hsien Chang [email protected] wrote:

Hi Mark, Thank you so much for creating this package. I am really loving it.

I'd like to validate my date columns and check if they are all in a consistent date format, say %Y-%m-%d, while checking for missingness. I think it is probably more understandable to explain what I want with three if statements: (1) If a value is in that format, it will add 1 to the passes column. (2) If a value is in another format, it will add 1 to the fails column. (3) If a missing value is found, it will add 1 to the nNA column.

I've tried is_complete(as.Date(date), format= "%Y-%m-%d") but unfortunately I don't know how to get the validator() to account for NA. I think zero in nNA is because of the effect of is_complete() checking only those that are not missing values. I got a syntax error when I removed the function is_complete(). How do I tweak my rule and get something in the nNA column?

Here is my code and output

Create some data

t <- data.frame( ID=c("0010","0011","001",NA,"11115") ,date1=c("2008-11-01", "2008-11-02",NA,"2008-11-03","04-11-2008") ,date2=c("2009-11-01", "2009-11-02","2009-11-03",NA,"04-11-2009") ,DOB=c("2000-11-01", "2001-11-02",NA,"2002-11-03","04-11-2007") ,stringsAsFactors = FALSE)

Change character columns to datetime

t$date1 <- as.POSIXct(t$date1, format= "%Y-%m-%d") t$date2 <- as.POSIXct(t$date2, format= "%Y-%m-%d") t$DOB<- as.POSIXct(t$DOB, format= "%Y-%m-%d")

Write validation rules

rules <- validator(check.field.length.ID= nchar(ID)==4 ,check.format.date1=is_complete(as.Date(date1), format= "%Y-%m-%d") ,check.date2.GTE.date1= date2 > date1)

Check data against the validation rules

out <- confront(dat=t, x=rules) summary(out)

name items passes fails nNA error warning expression

1 check.field.length.ID 5 2 2 1 FALSE FALSE nchar(ID) == 4

2 check.format.date1 5 4 1 0 FALSE FALSE is_complete(as.Date(date1), format = "%Y-%m-%d")

3 check.date2.GTE.date1 5 2 1 2 FALSE FALSE date2 > date1

Cheers, Chang

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/data-cleaning/validate/issues/118, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAOWSAXACVAW6DAE7QZBHT3SQSFHNANCNFSM4T22XGKA .

markvanderloo avatar Nov 19 '20 21:11 markvanderloo