nbstripout icon indicating copy to clipboard operation
nbstripout copied to clipboard

Usage vs. Jupyter save hook

Open michaelaye opened this issue 8 years ago • 33 comments

What is the advantage of doing this vs just having Jupyter let do it as describe in the docs here:

http://jupyter-notebook.readthedocs.org/en/latest/extending/savehooks.html

Maybe the fact that one can switch it on per repo and not generally for all Jupyter notebook activities?

If that is the reason, how can one reliably implement the concept of also version-controlling the python script version of the notebook using nbstripout? Because the trouble I'm seeing is that if I only let Jupyter do the python script generation and have nbstripout working as a pre-commit hook, then the script conversion will suffer from creating git diff noise because the pre-commit hook is run later, not when I save the notebook. So one would have to include the script generation somehow into the nbstripout tool I guess to have this possibility working well?

michaelaye avatar Jan 29 '16 23:01 michaelaye

I have to admit I wasn't aware of the Jupyter save hooks until now.

The main difference I think is that the save hook would lead to the output never being saved to disk in the first place i.e. if you were to stop the server and reopen the notebook you'd have no output.

With the nbstripout Git filter you do have the "full" notebook saved on disk, but Git ignores the output when diffing and committing.

I'm not sure I fully understand your use case of maintaining a separate (?) python script. Could you give an example?

kynan avatar Jan 30 '16 20:01 kynan

OK, having read the post_save_hook example in the docs you linked to makes it a bit clearer: is the problem that your script would contain the output (since you don't run a pre_save_hook to strip it) whereas your notebook (in Git's view) wouldn't?

kynan avatar Jan 30 '16 20:01 kynan

Yes (and the execution counts), which would mean that the python script would look different many times and hence create git diff noise.

michaelaye avatar Jan 31 '16 00:01 michaelaye

I haven't used the script output much. Have done some simple tests and it seems by default the execution counts are included as comments but the output is not?

Could you use nbstripout programmatically e.g. call strip_output it in your post save hook?

kynan avatar Jan 31 '16 09:01 kynan

I solved it by adapting the end of the post_save_hook to:

    with io.open(script_fname, 'w', encoding='utf-8') as f:
        for line in script.splitlines():
            if line.startswith("# In["):
                f.write("# newcell\n")
                continue
            f.write(line+'\n')

because the only thing left that could change in there where the execution counts. Yeah!

michaelaye avatar Jan 31 '16 22:01 michaelaye

FWIW: A problem with using the script option is that there is no guarantee that one will be able to convert the script back to a proper notebook. The Jupyter save-hook functionality is nice, but seems to then seems to need two files on disk - one clean and one with output - and then some functionality needs to be implemented to allow these to be merged etc. (which I would love, but seems a bit complicated to implement reliably).

My current strategy is to use the VCS to store the clean notebooks, and then tack on the notebook with output as an additional leave node that could be stripped out later. One can then track differences along the clean branch, or along the output branch. As long as the workflow never merges the output branch back into the clean branch, it can always be stripped or pruned later as needed.

I still need to work with this for a while to see how easy and reliable it is though.

mforbes avatar Feb 01 '16 19:02 mforbes

Why would you need to be able to go from the Python script back to the notebook if you always store the notebook as well? I see the Python script only as a human readable version of the notebook that is just always being created with every save of the notebook. It has no functional use for me.

michaelaye avatar Feb 03 '16 20:02 michaelaye

@michaelaye I think I do not understand your use case - what are you using the script for? My use case was to use the clean script as the definitive version of the notebook in my VCS: if there had been a safe way to get back to a notebook, then this might have been ideal - merges etc. could be performed on the script, and then the notebook reconstructed from this. Alas, this is not supported, so I store the clean notebook and (optionally) the notebook with output in the hopes that I don't need to do any sophisticated merging (or that the nbdiff project gets back up an running my the time I need to!)

mforbes avatar Feb 03 '16 20:02 mforbes

Notebooks are my development environment. But git diffs of the json code of notebooks are almost unreadable. I store the python version of it in GIT as well for easily readable changes of the code in the notebook.

michaelaye avatar Feb 03 '16 20:02 michaelaye

So I store both the notebook AND the python script, but the latter only because it leaves an easier way to track of changes then the json code of the notebook.

michaelaye avatar Feb 03 '16 20:02 michaelaye

What do you do about the output in your notebooks? Do you store that too and just not worry about the size of your repos?

mforbes avatar Feb 03 '16 20:02 mforbes

No, I strip it either with the pre_save_hook in my jupyter config or with the pre-commit hook as offered by nbstripout. The latter has the advantage that the content exists on disk but does not ‘exist’ for git.

michaelaye avatar Feb 03 '16 20:02 michaelaye

I understand. That is sort of what I have been doing (except not with the additional script), but now I would like to also include the output as a separate strippable branch in my VCS so that I can easily share results along with the code with collaborators.

mforbes avatar Feb 03 '16 20:02 mforbes

So, why do you want to share created output on ur machine with ur collaborators? Is it hard for them to recreate? Is it not tantamount that they will be able to recreate, otherwise you have code that runs only on one machine? And if it’s for information only, wouldn’t it be enough to share the PDF version of the notebook that contains the output? git is already complicated enough for me, your ‘strippable branches’ sound like a disaster to me. ;)

michaelaye avatar Feb 03 '16 20:02 michaelaye

Sometimes the results are the end of several hours of simulations, so they are hard to create. Other times collaborators are not familliar with the code, so they just want to see the graphs. Finally, I like to have a way of backing up my output temporarily or the work of collaborators who do not know how to use VCS (especially when working on a shared machine like Sage Mathcloud).

I agree that git is already complicated enough. I only do this with mercurial now which to me is much more understandable, and it so far seems pretty straightforward.

mforbes avatar Feb 03 '16 20:02 mforbes

@michaelaye, @mforbes: any conclusions on this?

kynan avatar Feb 15 '16 21:02 kynan

My conclusion is that the nbstripout use with .gitattributes is currently the only way to have both a cleaned notebook in git while keeping outputs on local disk. Plus, it offers a simple way to activate and deactivate it for a respective git repo, while using the pre-save-hook would require some Jupyter config hackery to make it work here but not there, as Jupyter does not support profiles currently.

michaelaye avatar Feb 15 '16 21:02 michaelaye

I still like the idea of using nbstripout controlled by the VCS through hooks etc. Ultimately I think this should be rolled into Jupyter hooks, but not until it is clear how to manage these (i.e. dealing with profiles as Michael mentions).

mforbes avatar Feb 15 '16 22:02 mforbes

Glad to hear nbstripout is a valuable complement to the Jupyter hooks. Mabye it's worth involving some of the Jupyter folks in this discussion? @minrk @takluyver @carreau

kynan avatar Feb 15 '16 22:02 kynan

After a quick read, I guess nbstripout is mostly equivalent to a convenience wrapper around:

$ jupyter nbconvert --to notebook --ClearOutputPreprocessor.enabled=True --inplace <notebook>

You can also use the --stdin, --stdout flag too (which might only be on master). But then it becomes hard to remember all the options.

It does though provide convenient methods to enable/disable the git clean/smudge filter, which is nice. We usually try to not get into the habits of not adding specific VCS like git into our codebase.

So i'm not sure there is anything to move into the Jupyter/Nbconvert codebase, though if you like to link to nbstripout from the documentation or nbconvert itself, I don't see any problem with that.

Also rolling that into jupyter make the release cycle of such tools slower, which might not be a good thing.

Carreau avatar Feb 15 '16 22:02 Carreau

Not really, b/c IIUC the jupyter nbconvert above will change the content on disk which we wanted to avoid? Or do you mean to put that command line somehow as a git filter itself?

michaelaye avatar Feb 15 '16 22:02 michaelaye

Not really, b/c IIUC the jupyter nbconvert above will change the content on disk which we wanted to avoid? Or do you mean to put that command line somehow as a git filter itself?

You do not need to change inplace, you can add --stdin --stdout:

$ jupyter nbconvert --to notebook --ClearOutputPreprocessor.enabled=True --stdout --stdin < Untitled11.ipynb
[NbConvertApp] Converting notebook into notebook
{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],

Unless I misunderstand what you are trying to achieve. Also I might be wrong but if I remember my bash correctly you can transform a stream into a filehandle with <(...) [note the omission of stdin]:

$ jupyter nbconvert --to notebook --ClearOutputPreprocessor.enabled=True --stdout <(cat Untitled11.ipynb)
[NbConvertApp] Converting notebook into notebook
{
 "cells": [
  {
   "cell_type": "co

But I don't expect anyone to be a Bash wizard.

Of course what I do it from the CLI, can be done with API, and yes it could be used as a git filter. You can try to do similar thing on pre/post-save hooks, and if you want something more complex, you might want to just subclass the content manager, like ipymd, or pgcontents

Unless I'm misunderstanding the question.

Carreau avatar Feb 16 '16 01:02 Carreau

Thanks! I think you got our aim correctly, basically we want only clear output for git, while leaving things untouched on the local stored version, so that the output is still there when continuing to work on the notebook at a later time. That's kinda the best world for me, having clear git diffs and much reduced git traffic, while not having to reproduce output immediately to remember where I left off with my work.

michaelaye avatar Feb 16 '16 01:02 michaelaye

Sure, we could potentially refactor nbstripout into a call to the nbconvert CLI. However I'm not convinced that buys us anything. We're already using the nbconvert API and I don't think that can be much simplified.

But if one wouldn't want to install nbstripout (and doesn't care about the convenient (un)install facility ;)) one could formulate a git filter purely in terms of a call to the nbconvert CLI as sketched by @Carreau.

kynan avatar Feb 16 '16 20:02 kynan

Nice reading this thread, thank you all for the discussions.

My conclusion is that the nbstripout use with .gitattributes is currently the only way to have both a cleaned notebook in git while keeping outputs on local disk. Plus, it offers a simple way to activate and deactivate it for a respective git repo, while using the pre-save-hook would require some Jupyter config hackery to make it work here but not there, as Jupyter does not support profiles currently.

@michaelaye: exactly what I am looking for, any way we could get a summary workflow (with examples) for how you're using nbstripoutlike this? Would be great to establish a common pattern for this for the Jupyter notebook community.

PS. I see that there are notes on nbstripout usage, however, feel that a little more prose coupled to a workflow would help in conceptual mapping. :)

nehalecky avatar Feb 22 '16 18:02 nehalecky

Sure, just let me finish this damn NASA proposal first (deadline Thursday). Kynan could have a go at simply adding what's missing to the README and I chime in with what I think is useful to understand or what I found in my searches of a workflow.

michaelaye avatar Feb 22 '16 18:02 michaelaye

Of course, this sounds great. @kynan, thanks again for this library, so crucial for collaborative analysis and development with Jupyter. Just awesome, really.

Please let me know how I can help.

nehalecky avatar Feb 22 '16 19:02 nehalecky

Thanks for the endorsement @nehalecky! I have to point out that this is based on work from @minrk and @mforbes has contributed a lot!

Would be great to have some more tests, so if you'd like to contribute some that'd be greatly appreciated!

kynan avatar Feb 23 '16 22:02 kynan

Hi @kynan, thanks for the reply. Sorry I haven't had time to contribute, but hope to do so soonish when we start using this library over the next couple of weeks. I'm still interested to see some sample workflows, but I understand everyone's time is limited. Hope to chime back in soon! Cheers.

nehalecky avatar Mar 02 '16 15:03 nehalecky

@nehalecky no worries! Documentation is still a bit lacking you're right. I demoed nbstripout at the PyData London meetup I co-organise on Tuesday. If I find some time I might turn this demo into a screencast, but no promises at this point.

kynan avatar Mar 03 '16 23:03 kynan