nbstripout icon indicating copy to clipboard operation
nbstripout copied to clipboard

Performance issue

Open michaelaye opened this issue 7 years ago • 13 comments

Everytime i go into a repo with lots of notebooks, it takes several seconds before i get my prompt back, which can be annoying... I'm wondering if the official PreProcessor of the notebook tools is faster than your manual filtering?

michaelaye avatar Aug 10 '16 17:08 michaelaye

I'm not sure how the prompt (of your shell?) can be affected by nbstripoutput, but if you want fast, try jq: http://janschulz.github.io/windows-dev-environment.html -> "proper diffs and commits for notebooks" (sorry, my homepage seems to not put in anchors in headlines :-( )

jankatins avatar Aug 10 '16 19:08 jankatins

oh, that's because my prompt (zsh->garrett) includes git status information, and the status is defined by git getting active, using the nbstripout filter to define the status for each notebook.

michaelaye avatar Aug 10 '16 19:08 michaelaye

Huih, I've currently 38 ipynb in a folder with a jq based stripoutput filter (*.ipynb filter=stripoutput) and it's fast in cmder (which also tells me the current git status). in my case git status is also fast (instant?). cmder uses git status --porcelain 2>nul to check if the dir is dirty...

jankatins avatar Aug 10 '16 19:08 jankatins

are u aware that git apparently is able to cache the status somehow? It's only slow for me for the first time entering the folder after I-dont-know-what-interval-or-action. After that it's seemless as well. 66 notebooks here though. ;)

michaelaye avatar Aug 10 '16 19:08 michaelaye

Thats sounds more like a filesystem cache (the git status manpage doesn't say anything about a cache for status information)? Do you have a ssd or hdd?

jankatins avatar Aug 10 '16 19:08 jankatins

ssd.

michaelaye avatar Aug 10 '16 19:08 michaelaye

@michaelaye Do you have core.preloadindex enabled? That might be it:

Enable parallel index preload for operations like git diff

This can speed up operations like git diff and git status especially on filesystems like NFS that have weak caching semantics and thus relatively high IO latencies. With this set to true, Git will do the index comparison to the filesystem data in parallel, allowing overlapping IO’s.

@janschulz Thanks for the pointer to jq. Not sure I'd want to introduce such a heavy dependency, but will have a play with it.

kynan avatar Aug 18 '16 19:08 kynan

FYI, I uploaded two versions of much faster nbstripout-like tools. They focus on being fast git strip out filters, without much else.

  1. nbstripout-fast - a pure python json version (it also supports rapidjson, but I found it to be slower)
  2. nbstripout-jq - jq version with a thin /bin/sh wrapper

As of this writing nbstripout-fast and nbstripout-jq are respectively 13 and 9 times faster than nbstripout with the notebooks I tested them with.

They aren't exactly the same as nbstripout, so you may need to apply a few minor tweaks if you need the exact behavior.

These were written for fastai - we ended up using the pure python version as it was the fastest.

Enjoy.

stas00 avatar Aug 17 '18 18:08 stas00

Thanks, @stas00! Feel free to send a PR for the rapidjson variant. I'd also be open to add the nbstripout-jq as a separate script.

kynan avatar Aug 19 '18 06:08 kynan

@kynan, the python rewrite is very different internally - so PR won't be possible. It doesn't use any of the nbconvert/nbformat stuff. it's faster than rapidjson or jq. On the other hand it only does the stripping out (no installation tools, validation, etc.).

I first tried to get nbstripout to use rapidjson or ujson, but neither has an identical API to json. So that didn't work.

Given that most likely the strip out has to be as fast as possible for using git with many notebooks, i'd say the original nbstripout needs to be split into 2 pieces - one optimized for speed of the critical path of doing the strip out and another for installation/configuration/validation/etc. If you do that then you can just use nbstripout-fast as is - with a few tweaks to support keep_outputs and and a few other flags. And leave the rest of the features in the other script.

Also, nbstripout-fast is already written in the whitelist format as you suggested here, except the fields are hardcoded and will probably need to be configurable to make it generic enough for anybody to use without changes to the source code.

And yes, feel free to incorporate nbstripout-jq in your repo. it too will need some tweaks to be identical to nbstripout. We dropped the jq version since pure python json rewrite nbstripout-fast is faster.

p.s. if you decide to integrate/adopt the code the -d option keeps outputs and also preserves a few other metadata fields that we use for documentation notebooks, which differs from code notebooks. Obviously this won't be useful for a general user.

stas00 avatar Aug 19 '18 17:08 stas00

Hi,

This is a really excellent project, but like above I found it had too much overhead on larger repos. I wrote a pure rust version (with python bindings so it can be pip installed) located at https://github.com/deshaw/nbstripout-fast (happed to chose the same name as @stas00). My testing shows this is ~200x faster. Really intersted to hear your thoughts @kynan. This is not a true a 1:1 match, but all key features should be included and a few more added.

Is there a way in which this project could let users choose to use the rust version? Your install/setups scripts are great and clearly this project is very very popular. I think some sort of linkage there would be a net postive for the community.

mlucool avatar Dec 08 '22 16:12 mlucool

This is a true a 1:1 match, but all key features should be included and a few more added.

Do you mean this is not a true 1:1 match?

Is there a way in which this project could let users choose to use the rust version?

In theory yes, however doesn't it feel a little odd to have one tool install another?

Can you give more detail on what exactly you're thinking of? How do you see nbstripouts installation work when choosing nbstripout-fast ? I also think this discussion should be moved to a separate issue.

kynan avatar Dec 10 '22 07:12 kynan

Do you mean this is not a true 1:1 match?

Yes - sorry!

I also think this discussion should be moved to a separate issue.

Opened https://github.com/kynan/nbstripout/issues/179

mlucool avatar Dec 10 '22 12:12 mlucool