rmarkdown icon indicating copy to clipboard operation
rmarkdown copied to clipboard

_files directory not removed when cache is active for knitr

Open jwhendy opened this issue 1 year ago • 8 comments

I was trying to figure out why I ended up with a fname_files directory, despite using self_contained: yes in my document. That led me to this issue which suggested this should be fixed, but I was still experiencing this, and I traced it to cache=T.

Here's a test Rmd.

With the ggplot chunk as-is, all is well. If I add cache=T to the options, I get a test_files directory which is not removed after rendering. The file really is self-contained. I can move it (after renaming and bypassing the annoying "this file will no longer be owned by the directory test_files" message) and open it fine, and the page source shows the png image embedded.

Apologies if this is known/expected; I'm pretty new to Rstudio/knitr and am not very familiar with caching behavior.


Update pre-submit: aaannnd like most things, after I write everything up, I realized I missed something. I'm going to submit anyway; if nothing else it might resolve someone else's confusion/curiosity down the road.

In this comment, I noted the condition ... & !dir_exists(cache_dir). Does this mean that even with standalone documents, if one is using cache=T, one should always expect the linked _files directory?


R version 4.2.1 (2022-06-23 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042), RStudio 2022.7.1.554

Locale:
  LC_COLLATE=English_United States.utf8  LC_CTYPE=English_United States.utf8   
  LC_MONETARY=English_United States.utf8 LC_NUMERIC=C                          
  LC_TIME=English_United States.utf8    

Package version:
  base64enc_0.1.3 bslib_0.4.0     cachem_1.0.6    digest_0.6.29   evaluate_0.16   fastmap_1.1.0   fs_1.5.2       
  glue_1.6.2      graphics_4.2.1  grDevices_4.2.1 highr_0.9       htmltools_0.5.3 jquerylib_0.1.4 jsonlite_1.8.0 
  knitr_1.40      magrittr_2.0.3  memoise_2.0.1   methods_4.2.1   R6_2.5.1        rappdirs_0.3.3  rlang_1.0.5    
  rmarkdown_2.16  sass_0.4.2      stats_4.2.1     stringi_1.7.8   stringr_1.4.1   tinytex_0.41    tools_4.2.1    
  utils_4.2.1     xfun_0.32       yaml_2.3.5     

Pandoc version: 2.18

jwhendy avatar Sep 03 '22 15:09 jwhendy

Thanks for the report.

Does this mean that even with standalone documents, if one is using cache=T, one should always expect the linked _files directory?

Yes when cache = TRUE, it means for plots that the code chunk won't be reevaluated, and plot re saved to file. It means that the file where plot was saved the first time will be reused. As those files are written in the figure dir, the path to the figure is saved.

What is saved from the plot chunk is in fact the all generated HTML like this

"<img src=\"test_files/figure-html/unnamed-chunk-2-1.png\" width=\"672\" />"

This is because knitr options allows to tweak some HTML attributes, so it needs to be saved.

This is why we need to keep the _files folder. self_contained = TRUE will have an effect only after the knitr steps, when Pandoc will convert to HTML. This is were the files is encoded so that the HTML only depends on itself. The pandoc processing is not part of the cache, so we don't save the encoded plot, only the file in the knitr step.

I hope this helps understand the behavior. I can understand the confusion though.

notes: for more advanced users, cache setting can be controlled so that file is not saved and plot results in R is saved only, so that plot is rewritten to file - however, we don't recommend that. More about cache: https://yihui.org/knitr/demo/cache/, https://bookdown.org/yihui/rmarkdown-cookbook/cache.html


Idea @yihui if we want to adapt this:

knitr could save also the file in the cache _cache folder and copy it in figure dir during the knit when cache is used. This way, if self_contained = TRUE, figure dir would be removed and only the _cache directory would need to be saved.

Currently, if someone wants to save the cache file (on CI for example, it will require to save the *_cache dir AND the _files dir. Not that complex but could be confusing.

Was it something you tried in the past ?


Full example to reproduce with above file
dir.create(tmp_dir <- tempfile())
owd <- setwd(tmp_dir)
url <- "https://gist.githubusercontent.com/jwhendy/f2f8023f6a520dec938c544b828aa440/raw/bd1369bc17c1096c958ab9b640f814dfca96f049/test.Rmd"
xfun::download_file(url)
#> [1] 0
rmd <- basename(url)
content <- xfun::read_utf8(rmd)
content[15] <- gsub("\\}$", " , cache = TRUE}", content[15])
xfun::write_utf8(content, rmd)
xfun::file_string(rmd)
#> ---
#> title: "test"
#> output:
#>   html_document:
#>   self_contained: yes
#> editor_options:
#>   chunk_output_type: console
#> ---
#> 
#> ```{r, echo=F, message=F}
#> library(dplyr)
#> library(ggplot2)
#> ```
#> 
#> ```{r, echo=F, message=F , cache = TRUE}
#> ggplot(mtcars, aes(x = mpg)) +
#>   geom_histogram(binwidth = 5, color="white")
#> ```
rmarkdown::render(rmd, quiet = TRUE)
#> 
#> Attachement du package : 'dplyr'
#> Les objets suivants sont masqués depuis 'package:stats':
#> 
#>     filter, lag
#> Les objets suivants sont masqués depuis 'package:base':
#> 
#>     intersect, setdiff, setequal, union
fs::dir_tree()
#> .
#> ├── test.html
#> ├── test.Rmd
#> ├── test_cache
#> │   └── html
#> │       ├── unnamed-chunk-4_8a7807240fa030296e9595005d997eb3.RData
#> │       ├── unnamed-chunk-4_8a7807240fa030296e9595005d997eb3.rdb
#> │       ├── unnamed-chunk-4_8a7807240fa030296e9595005d997eb3.rdx
#> │       └── __packages
#> └── test_files
#>     └── figure-html
#>         └── unnamed-chunk-4-1.png
setwd(owd)
unlink(tmp_dir, recursive = TRUE)

cderv avatar Sep 05 '22 09:09 cderv

I haven't tried that idea before, but it sounds like a good idea. Users can do it by themselves, though, by setting knitr::opts_chunk$set(fig.path = knitr::opts_chunk$get('cache.path') inside the Rmd document.

You have explained the problem clearly, and I don't have anything to add. In this case, the *_files directory is indeed not needed anymore for self_contained = TRUE, only if the *.html output file doesn't need to be regenerated in future. If we delete the *_files directory and regenerate *.html, we will run into an error (plot files not found).

yihui avatar Sep 06 '22 16:09 yihui

I hope this helps understand the behavior. I can understand the confusion though.

I tried... though I admit the nuances between what happens with cache=T and standalone are not so clear. That said, I don't necessarily need to understand :)

The practical issue I found bothersome was the whole "this file is owned by this directory." I ran into this as I wanted to upload my output .html to my team site and Windows brought the directory along with it. If we could break that link, the need to copy files is covered, and I could just delete the _files directories when I happened to see them (if I wanted).

If there's some additional improvement, I'm of course also all for it.

jwhendy avatar Sep 06 '22 17:09 jwhendy

Another nuance to this I wanted to bring up. I was using this trick so that I could maintain a notebooks directory separate from my generated output. In my header, I have:

knit: (function(input, ...) {rmarkdown::render(input, output_dir = "../output")})

Even with no caching, this results in a persisting _files directory. Not sure it's trivial to pass a different output directory around to the internals for cleanup, but wanted to mention it. My current idea for a workflow is not working due to this, as my hope was to keep my notebooks directory clean, but either (a) it gets littered with the output files and I have to move them manually or (b) my output directory has these htmls with inherited ownership I can't share easily as they want to bring the directory with them.

jwhendy avatar Sep 13 '22 14:09 jwhendy