readr icon indicating copy to clipboard operation
readr copied to clipboard

Lines read and skip lines use different evaluation in read_lines

Open pepijn-devries opened this issue 1 year ago • 8 comments

Thanks for your work on readr! It's most helpful, but I did came across the following problem.

I have a very large ASCII file which is too large to load entirely into memory. Therefore, I use read_lines to read it in chunks using the skip and n_max arguments, process the chunks and write the results to a file. It turned out that a specific line in the file was read twice. First I assumed that this was an error in the ASCII file, but after some testing it turned out that read_lines had read the same line twice.

It turns out that the skip arguments uses a different way of evaluating the number of lines (to be skipped) than the actual reading algorithm. I've prepared the following reprex by simplifying my case:

First prepare a text file with some nasty UTF8 characters:

library(readr)

dummy_text <-
  data.frame(
    a = sprintf("%03i", 1:255),                   ## add as identifier at the beginning of the line
    b = unlist(lapply(as.raw(1:255), rawToChar)), ## generate some nasty UTF8 characters
    c = "\n"                                      ## add line end
  )

## paste all lines together:
dummy_text <- paste(apply(as.matrix(dummy_text), 1, paste, collapse = ""), collapse = "")

## save as a temp file
dummy_file <- tempfile(fileext = ".txt")
writeLines(dummy_text, dummy_file)

Next, let's read from the file, 5 lines at a time:

chunk_size <- 5
lines_read <- 0
result <- character(0)

repeat {
  lines <- read_lines(dummy_file, skip = lines_read, n_max = chunk_size)
  print(problems(lines))
  if (length(lines) == 0) break
  lines_read <- lines_read + length(lines)
  result <- c(result, lines)
}

Next, try to extract the first three characters from each line, which I had added above as a unique identifier for each line:

result_check <- unlist(lapply(result, function(x) tryCatch({substr(x, 1, 3)}, error = function(e) NULL)))
duplicated(result_check[result_check != ""])

It turns out the the line starting with 014 is read twice. I suspect that "\U000d" is treated as line feed while reading the file, but not when counting the number of lines to be skipped. This causes the same line to be read twice. Is this intended (then this should be documented), or not (can this be fixed)?

This is my sessionInfo()

R version 4.1.1 (2021-08-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19044)

Matrix products: default

locale:
[1] LC_COLLATE=Dutch_Netherlands.1252  LC_CTYPE=Dutch_Netherlands.1252    LC_MONETARY=Dutch_Netherlands.1252
[4] LC_NUMERIC=C                       LC_TIME=Dutch_Netherlands.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] readr_2.1.3

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.7       pillar_1.9.0     dbplyr_2.3.2     cellranger_1.1.0 compiler_4.1.1   tools_4.1.1      digest_0.6.31   
 [8] bit_4.0.4        tibble_3.2.1     jsonlite_1.8.4   evaluate_0.21    RSQLite_2.2.8    memoise_2.0.1    lifecycle_1.0.3 
[15] lattice_0.20-44  pkgconfig_2.0.3  rlang_1.1.0      DBI_1.1.3        cli_3.4.1        rstudioapi_0.13  parallel_4.1.1  
[22] fastmap_1.1.0    withr_2.5.0      dplyr_1.1.2      httr_1.4.6       stringr_1.5.0    xml2_1.3.2       hms_1.1.2       
[29] generics_0.1.3   vctrs_0.6.2      rappdirs_0.3.3   tidyselect_1.2.0 bit64_4.0.5      grid_4.1.1       glue_1.6.2      
[36] R6_2.5.1         fansi_1.0.3      readxl_1.3.1     vroom_1.6.0      tzdb_0.1.2       blob_1.2.3       magrittr_2.0.3  
[43] ellipsis_0.3.2   leaps_3.1        rvest_1.0.1      utf8_1.2.2       stringi_1.7.6    cachem_1.0.6     crayon_1.5.2    

pepijn-devries avatar Jun 21 '23 07:06 pepijn-devries