warc Too many open files when mapping more than 509 pages

That is a great pleasure working with warc, however I'm experiencing error when mapping larger mount of files. It seems like the connections to the files are not closed. Please find below the reproducible minimum example:

library(warc)
library(tidyverse)

# download the Common Crawl example file if does not exist
warc_big <- normalizePath("~/cc.warc.gz")    
if(!file.exists(warc_big)){
  download.file(
    "https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698540409.8/warc/CC-MAIN-20161202170900-00000-ip-10-31-129-80.ec2.internal.warc.gz",
    warc_big
  )
}

# create index if does not exist
warc_cdx <- normalizePath("~/cc.cdx")
if(!file.exists(warc_cdx)){
  create_cdx(
    warc_big,
    cdx_path = warc_cdx
  )
}
  
# read the index and mapp the data
cdx <- read_cdx(warc_cdx)

# this works
sites <- map(1:100,
             ~read_warc_entry(file.path(cdx$warc_path[.],
                                        cdx$file_name[.]), 
                              cdx$compressed_arc_file_offset[.]))                     
                              
 # this crash
sites_large <- map(1:1000,
             ~read_warc_entry(file.path(cdx$warc_path[.],
                                        cdx$file_name[.]), 
                              cdx$compressed_arc_file_offset[.]))

The error I'm receiving is the following

Using the hard way
7593104
Error in gz_open(wf, "read") : object 'wf' not found

And if want to perform other operations getting:

> ?read_cdx
Error in gzfile(file, "rb") : cannot open the connection
In addition: Warning message:
  In gzfile(file, "rb") :
  cannot open compressed file 'C:/Program Files/R/R-3.4.1/library/reshape2/Meta/package.rds', probable reason 'Too many open files'

Session info:

> sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server >= 2012 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] bindrcpp_0.2    dplyr_0.7.2     purrr_0.2.2.2   readr_1.1.1     tidyr_0.6.3     tibble_1.3.3    ggplot2_2.2.1   tidyverse_1.1.1
[9] warc_0.1.0     

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.12     cellranger_1.1.0 compiler_3.4.1   plyr_1.8.4       bindr_0.1        forcats_0.2.0    tools_3.4.1     
 [8] uuid_0.1-2       lubridate_1.6.0  jsonlite_1.5     nlme_3.1-131     gtable_0.2.0     lattice_0.20-35  pkgconfig_2.0.1 
[15] rlang_0.1.1      psych_1.7.5      parallel_3.4.1   haven_1.1.0      xml2_1.1.1       httr_1.2.1       stringr_1.2.0   
[22] hms_0.3          grid_3.4.1       glue_1.1.1       R6_2.2.2         readxl_1.0.0     foreign_0.8-69   modelr_0.1.0    
[29] reshape2_1.4.2   magrittr_1.5     scales_0.4.1     rvest_0.3.2      assertthat_0.2.0 mnormt_1.5-5     colorspace_1.3-2
[36] stringi_1.1.5    lazyeval_0.2.0   munsell_0.4.3    broom_0.4.2

Thanks in advance

Jul 25 '17 21:07 trotsiuk

I'm having a similar issue -- did you ever find a fix?

Thanks!

Mar 13 '18 15:03 rcitrone

@rcitrone no. And there are was no reply from the developers

Mar 14 '18 13:03 trotsiuk

@trotsiuk @rcitrone @hrbrmstr I also want to get a WARC parser going for R, mostly to use it with Apache Spark. I have a draft extension here https://github.com/javierluraschi/sparkwarc which is more-or-less usable; however, I do like the idea of having a rather simpler warc package that just parses the gziped files. For me, I need to use RCPP to parse files faster then Scala, so the new jwarc project wouldn't work for me.

The only thing I personally need is a read_warc function that loads the warc into a data frame, something as simple as the following would work for me:

entry contents
1     WARC/1.0\nWARC-Type: metadata\nWARC-Date: 2016-12-11T13:54:37Z...
2     WARC/1.0\nWARC-Type: metadata\nWARC-Date: 2016-12-11T14:54:37Z...

So mostly read_warc(path), but ideally I would also like to perform basic filtering, as in: read_warc(path, entry_filter, line_filter) to retrieve only the given warcs or the given lines back to the data frame.

If that's all you need as well, I'll get this cleaned up under https://github.com/javierluraschi/warc. If you want/need more functionality than that, then we can work together or think of something else.

Mar 15 '18 19:03 javierluraschi