knitr
knitr copied to clipboard
knitr engine API and cache compatibility with reticulate engine
I think whether or not this is a knitr or reticulate bug depends on the knitr engine API, which I do not completely understand. I carefully searched for this bug in the knitr and reticulate issues and didn't see anything, so I apologize if this is already known.
The bug
Suppose we have the following file called python_test.Rmd
:
---
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE,
results='show', cache=TRUE, autodep=FALSE)
knitr::opts_knit$set(progress = TRUE, verbose = TRUE)
knitr::knit_engines$set(python = reticulate::eng_python)
```
```{python chunk1}
x = 1
print(x)
```
```{python chunk2}
print(x + 9)
```
When you first press the Knit button the document compiles successfully.
Now suppose I change chunk2, so it has print(x + 10)
and I save the file. If I try clicking the Knit button I get the following error:
Quitting from lines 19-20 (python_test.Rmd)
Error in py_run_string_impl(code, local, convert) :
NameError: name 'x' is not defined
Detailed traceback:
File "<string>", line 1, in <module>
Calls: <Anonymous> ... force -> py_run_string -> py_run_string_impl -> .Call
Execution halted
My efforts to debug
Here's what I've learned about the error:
It reliably happens when I call knitr::knit('python_test.Rmd', envir = new.env())
after a session restart, so I don't think it is an rmarkdown::render
error.
The error message points to py_run_string_impl
, which is a reticulate
function. But I believe the problem arises before knitr
reaches the python engine.
When you call knitr::knit('python_test.Rmd', envir = new.env())
, chunk1 eventually enters the call_block
function in knitr/R/block.R
. It passes if (params$cache > 0)
, and the hash comes up with the same value. Then cache$load
tries to bring the saved data into knit_global
.
At this point, if you ls(knit_global())
you'll see character(0)
. So it isn't clear if the python object x=1
was even saved.
However, whether it was saved doesn't even matter because it doesn't get a chance to use it. When chunk2
starts down the same path, its hash has changed so it moves onto block_exec
. If this were R code, it would have access to the cache$load
ed objects from chunk1 with env = knit_global()
, but non-R engines go down a separate branch.
block_exec
tells the reticulate engine to execute print(x + 9)
, but it fails because it doesn't know x
. You can verify this in eng_python_synchronize_before
by checking to see if main
contains x
, which it doesn't (assuming you alled knit
after a session restart). The only thing that passes to the eng_python
is options
which as far as I can tell doesn't include any environment information such as x=1
.
What is not clear to me
Despite crawling through the reticulate source code, when cache=FALSE
, I'm not actually sure how the state is saved between chunks. Each time a python chunk is executed by eng_python
reticulate/R/knitr-engine.R
it calls import_main
which provides an object main
that has the previous chunk's variables (x = 1
), but I don't see where this data is saved chunk to chunk.
I know the main
data has to be saved somewhere, because if you don't restart the session after calling knitr::knit('python_test.Rmd', envir = new.env())
the main
variable will still contain x = 1
, even though the knit(..)
call never runs x = 1
or loads it into memory since it was cached.
What to do?
I know everyone's busy, so I'm happy to help by making a PR but it is not clear to me how to fix this.
Ideas:
-
Factor out the chunk hashing. If
autodep=FALSE
, then if any cache fails you have toblock_exec
every chunk. -
Rethink the engine API, so the language can provide loading/saving for its chunks. In fact, reticulate already supports pickle, which is a python package that can save and load python objects. As far as I can tell, this would still require refactoring some knitr code since engines don't touch caching at all at the moment, as far as I can tell.
-
Suppose
.RData
is able to save the python object (I don't know if possible). Similar to number 2, you could slightly change the engine API so you pass that object to the engine to process.
Session data
> sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: OS X El Capitan 10.11.6
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] compiler_3.4.1 backports_1.1.2 magrittr_1.5.0 rsconnect_0.8.5
[5] rprojroot_1.3-2 htmltools_0.3.6 tools_3.4.1 yaml_2.1.16
[9] Rcpp_0.12.15 stringi_1.1.6 rmarkdown_1.8.10 knitr_1.19.2
[13] stringr_1.2.0 digest_0.6.15 evaluate_0.10.1
I'm afraid I'll have to defer this issue to @kevinushey (author of the python engine in reticulate).
For what it's worth, I never tried wiring in cache support into the reticulate engine as I wasn't exactly sure what that would entail, but it sounds like we'd need:
- Serialization of the 'main' module (probably with
pickle
?) - A list of Python modules to be imported (what about state within a particular module?)
Thanks @kevinushey. Something else to look into is dill
which extends pickle
. From the readme:
Hence, it would be feasable to save a interpreter session, close the interpreter, ship the pickled file to another computer, open a new interpreter, unpickle the session and thus continue from the 'saved' state of the original interpreter session.
And the reason I opened the issue here is because I think knitr
will also require some refactoring to allow specific engines to handle caching. Right now all caching is handled by the cache
methods in R/block.R
call_block
for loading and block_cache
for saving.
If this is something you'd like for reticulate, I'd be interested in helping out with PRs. I'd like to be able write Python with rmarkdown using all the knitr features.
I'd definitely be open to reviewing a PR, but it seems like this will be tough to get right and I unfortunately won't have that much time to help with the actual implementation in the coming months.
Hi! I'm working on it (by sheer necessity). There are some some serious problems on the dill
package at the moment, but I'm also updating some currently broken logic in the code proposed by @tmastny and it is mostly working by now, with some edge cases requiring monkey patches. As soon as the bugs on dill
are fixed, I'll send a new pull request based on his code.
Beyond basic usage, I find that a Python cache engine for knitr
is essential. We need this, folks! :rocket: