nbdime
nbdime copied to clipboard
Improve the performance of nbdime
I have a notebook which is about 4.3M. There's a small change to the notebook but it took 8 minutes to run the command nbdiff on a very powerful machine.
I had a similar problem today.
I tried to diff a 16 MB notebook without success: the RAM usage went through the roof, and I had to hit CTRL+C before hanging my system.
This resulted in a stacktrace (the signal handling may be improved?) which paradoxically gave a pointer to a possible candidate for improving RAM usage (which I suspect may be related to your problem):
This is the function nbdiff hanged on:
https://github.com/jupyter/nbdime/blob/409898921d5dfb4e248fd791dfe4ca47776a0f6d/nbdime/diffing/seq_bruteforce.py#L15-L17
The function returns a list of lists, i.e., stores in memory len(A) * len(B) elements. That list of lists is then iterated over by other functions in the call stack.
Returning a generator expression may be helpful, but the logic of the calling functions would then have to be changed as well: currently they are assuming that it is possible to have random access to every element of the array, and that their sizes are known. They would need to be rewritten in a streaming fashion (itertools may be helpful here).
I am not an avid Jupyter user. This problem was just tangential for me today, but it was glaring enough to be worth a look in nbdime's issue tracker, and here I am :smile:
Here's a quick profiling of the long running diff to get started with the analysis. Maybe we'll discover that the problem is somewhere else :smile: :
$ python -m cProfile -s cumtime venv/bin/nbdiff notebook1.ipynb notebook2.ipynb.old
[CTRL+C]
166281829 function calls (166075588 primitive calls) in 11.491 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
360/1 0.002 0.000 24.878 24.878 {built-in method builtins.exec}
1 0.000 0.000 24.878 24.878 nbdiff:3(<module>)
1 0.000 0.000 24.712 24.712 nbdiffapp.py:129(main)
1 0.000 0.000 24.708 24.708 nbdiffapp.py:30(main_diff)
1 0.000 0.000 24.708 24.708 nbdiffapp.py:50(_handle_diff)
1 0.000 0.000 24.176 24.176 notebooks.py:605(diff_notebooks)
48/1 0.000 0.000 24.176 24.176 generic.py:81(diff)
24/1 0.000 0.000 24.176 24.176 generic.py:193(diff_dicts)
33/1 0.000 0.000 24.176 24.176 generic.py:107(diff_sequence_multilevel)
33 0.000 0.000 24.163 0.732 snakes.py:35(compute_snakes_multilevel)
33 0.001 0.000 24.163 0.732 snakes.py:18(compute_snakes)
33 0.000 0.000 23.118 0.701 seq_bruteforce.py:63(bruteforce_compute_snakes)
33 0.000 0.000 23.116 0.700 seq_bruteforce.py:15(bruteforce_compare_grid)
33 0.330 0.010 23.116 0.700 seq_bruteforce.py:17(<listcomp>)
32/1 0.000 0.000 23.107 23.107 snakes.py:76(compute_diff_from_snakes)
3 0.000 0.000 22.076 7.359 notebooks.py:382(diff_single_outputs)
24 0.000 0.000 22.075 0.920 sequences.py:53(diff_strings_linewise)
2 0.000 0.000 22.074 11.037 notebooks.py:463(diff_mime_bundle)
3 0.001 0.000 22.074 7.358 notebooks.py:414(add_mime_diff)
24 0.000 0.000 22.065 0.919 generic.py:123(diff_lists)
165023450 8.345 0.000 8.345 0.000 {built-in method _operator.eq}
16 0.000 0.000 2.088 0.130 notebooks.py:269(compare_output_strict)
16 0.000 0.000 2.088 0.130 notebooks.py:181(compare_mimebundle_strict)
20 0.000 0.000 2.088 0.104 notebooks.py:154(compare_mimedata_strict)
20 0.019 0.001 2.088 0.104 notebooks.py:124(_compare_mimedata)
8 0.000 0.000 2.068 0.259 notebooks.py:115(_compare_mimedata_strings)
8 0.000 0.000 2.062 0.258 notebooks.py:54(_is_base64)
98 2.015 0.021 2.015 0.021 {method 'match' of 're.Pattern' objects}
47 0.013 0.000 1.089 0.023 {built-in method builtins.all}
182 0.000 0.000 1.069 0.006 notebooks.py:352(compare_cell_strict)
175 0.000 0.000 1.044 0.006 snakes.py:30(<genexpr>)
2 0.000 0.000 0.532 0.266 utils.py:25(read_notebook)
2 0.001 0.000 0.532 0.266 __init__.py:113(read)
- python 3.8.2
- nbdime 2.0.0
- linux
The key to getting this fixed (and any other issues) would be to share some reproducible example. Where are those MBs for the notebook located? Are there any specific characteristics? E.g. are they all large HTML outputs with embedded base64 strings?
@muxator The non-optimal brute force is a known issue (#3), but it is typically not a major issue for cells/source inputs, as these tend to be small. But if someone has a symptomatic case which we can use as a baseline, then optimizing performance becomes much easier 👍
if someone has a symptomatic case which we can use as a baseline, then optimizing performance becomes much easier
@vidartf, I agree.
I cannot share the specific notebook, because it was not written by me. I'll see if I can produce a redacted version with the same pathological characteristics, performance-wise.
This specific notebook was created by a scientist with a deep knowledge in his domain, that happens to use a notebook to fast prototype something, but that has no specific Computer Science background (and is not required to have, after all).
I am not invested at all in the Jupyter ecosystem, so I may be wrong here: this kind of figure probably fits very well in the target demographic for Jupyter. With such a user base I think that worst-case performance problems are guaranteed to be hit :smile: Also, factor in that if the user has a beefy development machine, he can sometimes be even masked away from problematic code/data.
Back to the point: I'll try to summarize the notebook contents asap. I hope I can find a way to redact it (or synthesize something similar).
There is an alternative maybe: @dclong, would it be easy for you to share a (anonimyzed) version of your problematic notebook? You are probably more proficient than me in using Jupyter.
In the meantime, a quick summary of my problematic notebook:
- Total size: 16 MB;
- of these, 13 MB is a single line of base64-encoded PNG data;
- the remaining 3 MB are distributed in 98k lines of json, but there are only ~15 cells.
This probably ticks all the boxes for a pathological notebook :smile:, yet it appears completely legit to the eyes of the user.