arrow
arrow copied to clipboard
How do I remove a fileSystemDataset object without calling garbage collection from R?
I would like to create a fileSystemDataset object in a temp folder, process it in batches, and then remove it. This works fine on Linux and Mac but on Windows a file lock prevents removal of the temp folder. I think the lock is created by arrow and is not released until manually call the garbage collector. I don't think I should (or am allowed to) call the garbage collector from inside a function that is part of a CRAN hosted R package. So how do I remove the file lock in Windows so I can delete the fileSystemDataset?
Reproducible example below.
library(arrow)
# create a FileSystemDataset object
filename <- here::here("tmp")
write_dataset(cars, filename, format = "feather")
ds <- open_dataset(filename, format = "feather")
ds
#> FileSystemDataset with 1 Feather file
#> speed: double
#> dist: double
#>
#> See $metadata for additional Schema metadata
# process the file in batches
scanner <- ScannerBuilder$create(ds)$BatchSize(batch_size = 4)$Finish()
reader <- scanner$ToRecordBatchReader()
batch_num <- 1
while(!is.null(batch <- reader$read_next_batch())) {
print(paste("Reading batch", batch_num, "with", nrow(batch), "rows"))
batch_num <- batch_num + 1
}
#> [1] "Reading batch 1 with 4 rows"
#> [1] "Reading batch 2 with 4 rows"
#> [1] "Reading batch 3 with 4 rows"
#> [1] "Reading batch 4 with 4 rows"
#> [1] "Reading batch 5 with 4 rows"
#> [1] "Reading batch 6 with 4 rows"
#> [1] "Reading batch 7 with 4 rows"
#> [1] "Reading batch 8 with 4 rows"
#> [1] "Reading batch 9 with 4 rows"
#> [1] "Reading batch 10 with 4 rows"
#> [1] "Reading batch 11 with 4 rows"
#> [1] "Reading batch 12 with 4 rows"
#> [1] "Reading batch 13 with 2 rows"
rm(reader)
rm(scanner)
rm(ds)
# remove the file
rc <- unlink(filename, recursive = TRUE)
if(rc == 1) print("removal of file failed")
#> [1] "removal of file failed"
file.exists(filename)
#> [1] TRUE
# call gc()
gc()
#> used (Mb) gc trigger (Mb) max used (Mb)
#> Ncells 1115105 59.6 2401181 128.3 1234217 66.0
#> Vcells 1937727 14.8 8388608 64.0 3294370 25.2
# remove the file
rc <- unlink(filename, recursive = TRUE)
if(rc == 1) print("removal of file failed")
file.exists(filename)
#> [1] FALSE
Created on 2022-10-19 by the reprex package (v2.0.1)
Session info
sessioninfo::session_info()
#> - Session info ---------------------------------------------------------------
#> setting value
#> version R version 4.0.5 (2021-03-31)
#> os Windows 10 x64
#> system x86_64, mingw32
#> ui RTerm
#> language (EN)
#> collate English_United States.1252
#> ctype English_United States.1252
#> tz America/New_York
#> date 2022-10-19
#>
#> - Packages -------------------------------------------------------------------
#> package * version date lib source
#> arrow * 9.0.0.2 2022-10-02 [1] CRAN (R 4.0.5)
#> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.0.5)
#> backports 1.4.0 2021-11-23 [1] CRAN (R 4.0.5)
#> bit 4.0.4 2020-08-04 [1] CRAN (R 4.0.5)
#> bit64 4.0.5 2020-08-30 [1] CRAN (R 4.0.5)
#> cli 3.0.1 2021-07-17 [1] CRAN (R 4.0.5)
#> crayon 1.5.1 2022-03-26 [1] CRAN (R 4.0.5)
#> DBI 1.1.2 2021-12-20 [1] CRAN (R 4.0.5)
#> digest 0.6.27 2020-10-24 [1] CRAN (R 4.0.5)
#> dplyr 1.0.8 2022-02-08 [1] CRAN (R 4.0.5)
#> ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.0.5)
#> evaluate 0.14 2019-05-28 [1] CRAN (R 4.0.5)
#> fansi 0.5.0 2021-05-25 [1] CRAN (R 4.0.5)
#> fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.0.5)
#> fs 1.5.0 2020-07-31 [1] CRAN (R 4.0.5)
#> generics 0.1.2 2022-01-31 [1] CRAN (R 4.0.5)
#> glue 1.6.2 2022-02-24 [1] CRAN (R 4.0.5)
#> here 1.0.1 2020-12-13 [1] CRAN (R 4.0.5)
#> highr 0.9 2021-04-16 [1] CRAN (R 4.0.5)
#> htmltools 0.5.2 2021-08-25 [1] CRAN (R 4.0.5)
#> knitr 1.36 2021-09-29 [1] CRAN (R 4.0.5)
#> lifecycle 1.0.1 2021-09-24 [1] CRAN (R 4.0.5)
#> magrittr 2.0.1 2020-11-17 [1] CRAN (R 4.0.5)
#> pillar 1.7.0 2022-02-01 [1] CRAN (R 4.0.5)
#> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.0.5)
#> purrr 0.3.4 2020-04-17 [1] CRAN (R 4.0.5)
#> R6 2.5.1 2021-08-19 [1] CRAN (R 4.0.5)
#> reprex 2.0.1 2021-08-05 [1] CRAN (R 4.0.5)
#> rlang 1.0.2 2022-03-04 [1] CRAN (R 4.0.5)
#> rmarkdown 2.10 2021-08-06 [1] CRAN (R 4.0.5)
#> rprojroot 2.0.2 2020-11-15 [1] CRAN (R 4.0.5)
#> rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.0.5)
#> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.0.5)
#> stringi 1.7.5 2021-10-04 [1] CRAN (R 4.0.5)
#> stringr 1.4.0 2019-02-10 [1] CRAN (R 4.0.5)
#> styler 1.5.1 2021-07-13 [1] CRAN (R 4.0.5)
#> tibble 3.1.2 2021-05-16 [1] CRAN (R 4.0.5)
#> tidyselect 1.1.2 2022-02-21 [1] CRAN (R 4.0.5)
#> tzdb 0.2.0 2021-10-27 [1] CRAN (R 4.0.5)
#> utf8 1.2.1 2021-03-12 [1] CRAN (R 4.0.5)
#> vctrs 0.3.8 2021-04-29 [1] CRAN (R 4.0.5)
#> withr 2.5.0 2022-03-03 [1] CRAN (R 4.0.5)
#> xfun 0.25 2021-08-06 [1] CRAN (R 4.0.5)
#> yaml 2.2.1 2020-02-01 [1] CRAN (R 4.0.5)
#>
#> [1] C:/Users/adam.DESKTOP-D3KQQA1/Documents/R/win-library/4.0
#> [2] C:/Program Files/R/R-4.0.5/library
This looks like a similar problem to https://issues.apache.org/jira/browse/ARROW-16421. @wjones127 - that JIRA ticket is currently assigned to you - any ideas on possible workarounds etc?
@thisisnic, @wjones127 - Any progress on a workaround for this?
I've asked around, but I'm not sure there is a workaround; calling gc() might be the best solution for the moment.
It might be good to revisit this after 11.0.0. The reader created from a scanner should have a close method that can be called. It should abort the plan and wait for the remaining tasks to finish up. I don't know that this fully removes all objects from memory (R is still doing something weird here) but it would ensure the file is closed and this test case should pass.
Also that is "should have a close method" in the sense of "we should do this" and not "it should already exist" (RecordBatchReader does have Close but I don't think the record batch reader that R uses today has an implemented close method)
(RecordBatchReader does have Close but I don't think the record batch reader that R uses today has an implemented close method)
The R record batch reader does implement Close(), but adding it to the repro doesn't seem to fix the issue.
The R record batch reader does implement Close(), but adding it to the repro doesn't seem to fix the issue.
Hmm...I may take a look. This may be something the newer scanner fixes.
I'm going to try to rig a solution for this for the upcoming release, since we have a lot of open issues about this one (ARROW-18313, ARROW-17208, ARROW-17002, ARROW-16421, ARROW-16452).
We can discuss on the PR, but basically, we create many temporary R6 objects in the process of creating an ExecPlan. Those R objects keep shared pointers alive until the garbage collector runs. There are some cases where we can clean up some of those references by resetting the shared pointer when the function exits (which is predictable) rather than when the garbage collector runs (which is not). In the case of a dplyr::collect() we don't surface any R6 objects to the user so there shouldn't be any need for any lingering shared_ptr references to exist (at least because of R).
I'd propose that we add a $unsafe_delete() method to ArrowObject - or at least to a few types of objects - and see to what extent cleaning up those temporary references can avoid open files by the time collect() returns.
I'm fairly sure that the linked PR works (or at least helps)...I created a binary package from crossbow that you can install to try it without building Arrow from source...I'd be grateful for testing! See https://github.com/apache/arrow/pull/15278#issuecomment-1380863503 for instructions on how to install the binary fix (and PR comments for the various things I tried to verify that the fix worked).