knitr knitr engine API and cache compatibility with reticulate engine

I think whether or not this is a knitr or reticulate bug depends on the knitr engine API, which I do not completely understand. I carefully searched for this bug in the knitr and reticulate issues and didn't see anything, so I apologize if this is already known.

The bug

Suppose we have the following file called python_test.Rmd:

---
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE, 
                      results='show', cache=TRUE, autodep=FALSE)
knitr::opts_knit$set(progress = TRUE, verbose = TRUE)
knitr::knit_engines$set(python = reticulate::eng_python)
```

```{python chunk1}
x = 1
print(x)
```


```{python chunk2}
print(x + 9)
```

When you first press the Knit button the document compiles successfully.

Now suppose I change chunk2, so it has print(x + 10) and I save the file. If I try clicking the Knit button I get the following error:

Quitting from lines 19-20 (python_test.Rmd) 
Error in py_run_string_impl(code, local, convert) : 
  NameError: name 'x' is not defined

Detailed traceback: 
  File "<string>", line 1, in <module>
Calls: <Anonymous> ... force -> py_run_string -> py_run_string_impl -> .Call
Execution halted

My efforts to debug

Here's what I've learned about the error:

It reliably happens when I call knitr::knit('python_test.Rmd', envir = new.env()) after a session restart, so I don't think it is an rmarkdown::render error.

The error message points to py_run_string_impl, which is a reticulate function. But I believe the problem arises before knitr reaches the python engine.

When you call knitr::knit('python_test.Rmd', envir = new.env()), chunk1 eventually enters the call_block function in knitr/R/block.R. It passes if (params$cache > 0), and the hash comes up with the same value. Then cache$load tries to bring the saved data into knit_global.

At this point, if you ls(knit_global()) you'll see character(0). So it isn't clear if the python object x=1 was even saved.

However, whether it was saved doesn't even matter because it doesn't get a chance to use it. When chunk2 starts down the same path, its hash has changed so it moves onto block_exec. If this were R code, it would have access to the cache$loaded objects from chunk1 with env = knit_global(), but non-R engines go down a separate branch.

block_exec tells the reticulate engine to execute print(x + 9), but it fails because it doesn't know x. You can verify this in eng_python_synchronize_before by checking to see if main contains x, which it doesn't (assuming you alled knit after a session restart). The only thing that passes to the eng_python is options which as far as I can tell doesn't include any environment information such as x=1.

What is not clear to me

Despite crawling through the reticulate source code, when cache=FALSE, I'm not actually sure how the state is saved between chunks. Each time a python chunk is executed by eng_python reticulate/R/knitr-engine.R it calls import_main which provides an object main that has the previous chunk's variables (x = 1), but I don't see where this data is saved chunk to chunk.

I know the main data has to be saved somewhere, because if you don't restart the session after calling knitr::knit('python_test.Rmd', envir = new.env()) the main variable will still contain x = 1, even though the knit(..) call never runs x = 1 or loads it into memory since it was cached.

What to do?

I know everyone's busy, so I'm happy to help by making a PR but it is not clear to me how to fix this.

Ideas:

Factor out the chunk hashing. If autodep=FALSE, then if any cache fails you have to block_exec every chunk.
Rethink the engine API, so the language can provide loading/saving for its chunks. In fact, reticulate already supports pickle, which is a python package that can save and load python objects. As far as I can tell, this would still require refactoring some knitr code since engines don't touch caching at all at the moment, as far as I can tell.
Suppose .RData is able to save the python object (I don't know if possible). Similar to number 2, you could slightly change the engine API so you pass that object to the engine to process.

Session data

> sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: OS X El Capitan 10.11.6

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] compiler_3.4.1   backports_1.1.2  magrittr_1.5.0   rsconnect_0.8.5 
 [5] rprojroot_1.3-2  htmltools_0.3.6  tools_3.4.1      yaml_2.1.16     
 [9] Rcpp_0.12.15     stringi_1.1.6    rmarkdown_1.8.10 knitr_1.19.2    
[13] stringr_1.2.0    digest_0.6.15    evaluate_0.10.1

Feb 11 '18 20:02 tmastny

I'm afraid I'll have to defer this issue to @kevinushey (author of the python engine in reticulate).

Feb 12 '18 16:02 yihui

For what it's worth, I never tried wiring in cache support into the reticulate engine as I wasn't exactly sure what that would entail, but it sounds like we'd need:

Serialization of the 'main' module (probably with pickle?)
A list of Python modules to be imported (what about state within a particular module?)

Feb 12 '18 18:02 kevinushey

Thanks @kevinushey. Something else to look into is dill which extends pickle. From the readme:

Hence, it would be feasable to save a interpreter session, close the interpreter, ship the pickled file to another computer, open a new interpreter, unpickle the session and thus continue from the 'saved' state of the original interpreter session.

And the reason I opened the issue here is because I think knitr will also require some refactoring to allow specific engines to handle caching. Right now all caching is handled by the cache methods in R/block.R call_block for loading and block_cache for saving.

If this is something you'd like for reticulate, I'd be interested in helping out with PRs. I'd like to be able write Python with rmarkdown using all the knitr features.

Feb 12 '18 23:02 tmastny

I'd definitely be open to reviewing a PR, but it seems like this will be tough to get right and I unfortunately won't have that much time to help with the actual implementation in the coming months.

Feb 12 '18 23:02 kevinushey

Hi! I'm working on it (by sheer necessity). There are some some serious problems on the dill package at the moment, but I'm also updating some currently broken logic in the code proposed by @tmastny and it is mostly working by now, with some edge cases requiring monkey patches. As soon as the bugs on dill are fixed, I'll send a new pull request based on his code.

Beyond basic usage, I find that a Python cache engine for knitr is essential. We need this, folks! :rocket:

Apr 22 '22 19:04 leogama

knitr knitr copied to clipboard

knitr engine API and cache compatibility with reticulate engine

The bug

My efforts to debug

What is not clear to me

What to do?

Session data

knitr
knitr copied to clipboard