Datetime parsing randomly (but rarely) fails when using multiple threads and latin1 encoding
Thank you for the lovely package. When using vroom to parse a file with datetime values, with the latin1 encoding and more than one thread, randomly, but very rarely, it will report that certain times are not formatted as expected.
I have tried to make this example minimal, but because it isn't deterministic, I have had to guess at the size of data and number of replications needed to consistently generate at least one error. Below is code for the bug reproduction.
# Create test file.
times <-
c("31JAN2015:18:47:49", "31JAN2015:19:35:09", "31JAN2015:21:10:28",
"31JAN2015:20:02:19", "31JAN2015:18:04:39", "31JAN2015:19:58:32",
"31JAN2015:18:07:25", "31JAN2015:18:30:29", "31JAN2015:19:54:57",
"31JAN2015:20:17:13", "31JAN2015:19:44:46", "31JAN2015:20:30:18",
"31JAN2015:20:01:47", "31JAN2015:20:35:36", "31JAN2015:20:21:47",
"31JAN2015:18:39:52", "31JAN2015:20:51:51", "31JAN2015:21:26:30",
"31JAN2015:21:27:06", "31JAN2015:20:07:45", "31JAN2015:22:02:21",
"31JAN2015:20:35:48", "31JAN2015:20:23:30", "31JAN2015:21:10:12",
"31JAN2015:22:05:21", "31JAN2015:20:26:31", "31JAN2015:22:16:10",
"31JAN2015:22:11:14", "01FEB2015:01:08:45")
file <- tempfile()
write.csv(data.frame(a = times),
file,
row.names = FALSE,
fileEncoding = "latin1")
library(vroom)
probs <- function(){
test <-
vroom::vroom(file,
delim = ";", # Can be anything not in times.
progress = FALSE,
num_threads = 2, # Anything greater than 1
locale = locale(
encoding = "latin1" # Necessary for bug repro
),
col_types = cols(
a = col_datetime(format = "%d%b%Y:%H:%M:%OS")
)
)
problems(test)
}
# Read test file 5000 times.
first <- replicate(5000, probs(), simplify = FALSE)
# Display all reads with problems.
first[sapply(first,nrow)>0]
I would expect that code to not fail on any read. Even if there was an error, I would expect it to be the same error every time. But on all machines I have tested you will get some reads that fail on random rows, like:
[[1]]
# A tibble: 1 x 5
row col expected actual file
<int> <int> <chr> <chr> <chr>
1 6 1 date like %d%b%Y:%H:%M:%OS 31JAN2015:18:04:39 -
[[2]]
# A tibble: 1 x 5
row col expected actual file
<int> <int> <chr> <chr> <chr>
1 8 1 date like %d%b%Y:%H:%M:%OS 31JAN2015:18:07:25 -
[[3]]
# A tibble: 1 x 5
row col expected actual file
<int> <int> <chr> <chr> <chr>
1 9 1 date like %d%b%Y:%H:%M:%OS 31JAN2015:18:30:29 -
I have recreated this issue on Windows and Linux with vroom 1.5.7, with R version 4.1.3. I have also recreated this issue with the development version of vroom (1.6.0.9000). I also tested on R 3.6.3 on Linux.
I see this too. Slighty improve tweaked reprex below:
library(vroom)
times <- c("31JAN2015:18:47:49", "31JAN2015:19:35:09", "31JAN2015:21:10:28", "31JAN2015:20:02:19", "31JAN2015:18:04:39", "31JAN2015:19:58:32", "31JAN2015:18:07:25", "31JAN2015:18:30:29", "31JAN2015:19:54:57", "31JAN2015:20:17:13", "31JAN2015:19:44:46", "31JAN2015:20:30:18", "31JAN2015:20:01:47", "31JAN2015:20:35:36", "31JAN2015:20:21:47", "31JAN2015:18:39:52", "31JAN2015:20:51:51", "31JAN2015:21:26:30", "31JAN2015:21:27:06", "31JAN2015:20:07:45", "31JAN2015:22:02:21", "31JAN2015:20:35:48", "31JAN2015:20:23:30", "31JAN2015:21:10:12", "31JAN2015:22:05:21", "31JAN2015:20:26:31", "31JAN2015:22:16:10", "31JAN2015:22:11:14", "01FEB2015:01:08:45")
file <- tempfile()
write.csv(data.frame(a = times), file, row.names = FALSE, fileEncoding = "latin1")
probs <- function() {
test <- vroom(
file,
delim = ",",
progress = FALSE,
num_threads = 2,
locale = locale(encoding = "latin1"),
col_types = cols(a = col_datetime(format = "%d%b%Y:%H:%M:%OS"))
)
problems(test)
}
first <- suppressWarnings(replicate(1000, probs(), simplify = FALSE))
dplyr::bind_rows(first, .id = "id")
#> # A tibble: 14 × 6
#> id row col expected actual file
#> <chr> <int> <int> <chr> <chr> <chr>
#> 1 79 30 1 date like %d%b%Y:%H:%M:%OS 01FEB2015:01:08:45 /private/tmp…
#> 2 85 30 1 date like %d%b%Y:%H:%M:%OS 01FEB2015:01:08:45 /private/tmp…
#> 3 132 30 1 date like %d%b%Y:%H:%M:%OS 01FEB2015:01:08:45 /private/tmp…
#> 4 133 30 1 date like %d%b%Y:%H:%M:%OS 01FEB2015:01:08:45 /private/tmp…
#> 5 243 30 1 date like %d%b%Y:%H:%M:%OS 01FEB2015:01:08:45 /private/tmp…
#> 6 459 30 1 date like %d%b%Y:%H:%M:%OS 01FEB2015:01:08:45 /private/tmp…
#> 7 470 13 1 date like %d%b%Y:%H:%M:%OS 31JAN2015:20:30:18 /private/tmp…
#> 8 552 30 1 date like %d%b%Y:%H:%M:%OS 01FEB2015:01:08:45 /private/tmp…
#> 9 592 30 1 date like %d%b%Y:%H:%M:%OS 01FEB2015:01:08:45 /private/tmp…
#> 10 680 30 1 date like %d%b%Y:%H:%M:%OS 01FEB2015:01:08:45 /private/tmp…
#> 11 706 30 1 date like %d%b%Y:%H:%M:%OS 01FEB2015:01:08:45 /private/tmp…
#> 12 747 30 1 date like %d%b%Y:%H:%M:%OS 01FEB2015:01:08:45 /private/tmp…
#> 13 866 30 1 date like %d%b%Y:%H:%M:%OS 01FEB2015:01:08:45 /private/tmp…
#> 14 881 30 1 date like %d%b%Y:%H:%M:%OS 01FEB2015:01:08:45 /private/tmp…
Created on 2023-08-01 with reprex v2.0.2
It's weird that the encoding is important for the reprex, giving that it's a pure ASCII file.