jupyter_server icon indicating copy to clipboard operation
jupyter_server copied to clipboard

Improving Network Performance

Open mlucool opened this issue 3 years ago • 14 comments

How can we optimally transfer assets to Jupyter clients (web browsers)?

Hypothesis: HTTP2 (i.e. no head of line blocking) and compression would meaningful improve page load and large notebook load performance.

Experiment: Create an nginx config that adds in ssl/http 2/compression and use it as a simple reverse proxy in front of a jupyterlab 2.x server. Then use chrome dev tools to understand changes to performance. In this setup my server and browser are not in the same physical location, but are connected by a high speed network. I had exactly one location block so static assets were still came via tornado.

Conclusion: Surprisingly, these technologies when naively put on top of a [email protected] server did not make a meaningful difference. The reverse proxy decreased the size of small assets, but increased the time for page load by ~10%. For large assets they clearly shrunk their size by a large amount 10x-23,000x (the latter is a generated very compressible test notebook) but the time to compress these on the fly meant there were minimal gains to be had. The ~10mb vendors bundle I had was compressed to 2.5Mb but took longer to get to the browser. A 33mb notebook shrank to 1.6kb still took about 30s either way. I'll note, most of my notebooks are small (<5MB).

I ran a second experiment where I put the notebook directly behind the same nginx server. In this case I was able to download the 33MB notebook in ~100ms!

In my view this experiment points to some large gains that can be had by letting assets skip the python server or thinning out the code path between the two. A few suggestions:

  1. Create a document for how to configure nginx/apache in front of Jupyter Server. This document should include tips of the right settings to skip the jupyter server for certain assets (e.g. anything in /static).
  2. If an asset is in /static jupyter should treat it as such as set the right headers (today I see no-cache set for example). Doing 1 should help people automatically do this, but doing this in jupyter_server may be useful for the average case.
  3. Dig into where the 30s is going when sending a large notebook. My guess is we'll need to skip over some steps in python or preload them in memory, but that's just a hunch and we'll need more data.

Pictures are worth 1000 words: No optimizations page load: image

Nginx page load: image

No optimizations large notebook: image

Nginx large notebook: image

Directly sending the notebook (renamed to foo.json): image

cc @goanpeca

mlucool avatar Sep 22 '20 15:09 mlucool

Hi, @mlucool thanks for the information. I started digging on this and found the following bottlenecks:

Trying out the notebook you shared, the following lines are taking this amount of time on my computer:

Read: 23.13 secs, and this is directly related to nbformat.read() Trust: 0,9 secs Validate: 12,43 secs and this is directly related to nbformat.validate

So reading the notebook with nbformat.read and nbformat.validate seem to be the biggest problems here.

Now that this is isolated I can start looking directly into nbformat to find what can be improved.

goanpeca avatar Sep 23 '20 02:09 goanpeca

A simple script to test this.

import nbformat
import time

TEST_FILE = '50000-errors.ipynb'


def test():
    as_version = 4
    start_time = time.time()
    print("Start:\t0.00")

    with open(TEST_FILE, 'r', encoding='utf-8') as f:
        model = nbformat.read(f, as_version=as_version)

    print("Open:\t"+ str(round(time.time() - start_time, 2)))

    nbformat.validate(model)
    print("Valid:\t"+ str(round(time.time() - start_time, 2)))


if __name__ == "__main__":
    test()

Yields in seconds:

Start:   0.00
Open:   10.78
Valid:  21.0

goanpeca avatar Sep 23 '20 02:09 goanpeca

Some more progress it seems the read methods is already calling validate inside, so on jupyterserver we are validating twice,

  1. On the read (nbformat does this)
  2. On jupyter_server we are calling validate on an already validated notebook (it was validated on read)

So we could probably remove the extra validation on Jupyter_sever, or have an optional parameter on nbformat.read to not perform validation. Any thoughts on why validation is being performed twice @kevin-bates, @Zsailer, maybe there is a historical reason?

This would already cut the time in half. Now validation is still taking ~10 seconds which seems like really too much time.

Looking into validation now.

Some reference docs:

  • https://www.peterbe.com/plog/jsonschema-validate-10x-faster-in-python
  • https://horejsek.github.io/python-fastjsonschema/

goanpeca avatar Sep 23 '20 02:09 goanpeca

Thx @goanpeca for these findings. Not sure if there is an historical reason for this. I guess if we don't find a reason, we should ensure no double validation is applied.

If we don't want to depend on nbformat release, we could just remove the double validation in jupyter_server. However, to bring faster json parsing/validation, we would need anyway to change nbformat and have a release. This could be done in 2 steps anyway: remove, and then optimize.

PS: Any change in jupyter_server will be useful for jupyterlab3. If we want to have this for jupyterlab2.x, we would need to apply those changes and release a notebook (https://github.com/jupyter/notebook)

echarles avatar Sep 23 '20 06:09 echarles

Hi @echarles, thanks for the feedback!

S: Any change in jupyter_server will be useful for jupyterlab3. If we want to have this for jupyterlab2.x, we would need to apply those changes and release a notebook (https://github.com/jupyter/notebook)

Can do :-)

Are there any thoughts in actually replacing jsonschema with fastjsonschema?

I will run some benchmarks to compare, but it seems like an easy win (if all test pass of course 🙃 )

goanpeca avatar Sep 23 '20 12:09 goanpeca

Bumping this issue. It would be good to be able to serve static assets via nginx/apache as we could pre-compress them and use HTTP2 (which tornado does not support).

One idea is jupyter-server could have an opt-in-features that creates a directory of symlinks to static assets. E.g. if I set app-dir to /foo/bar and then the generated static_dir_for_proxy could look like:

static->/foo/bar/static

The same would also be done for extensions (/lab/extensions/*/static/*).

After this feature, we can document a minimal nginx config that adds HTTP2/compression and intercepts static resources from torando.

Thoughts?

mlucool avatar Feb 24 '22 16:02 mlucool

I can't really comment on static asset performance, but did want to comment on something @goanpeca posted nearly a year and a half ago (and I apologize for not seeing this, or responding, sooner!):

Some more progress it seems the read methods is already calling validate inside, so on jupyterserver we are validating twice,

  1. On the read (nbformat does this)
  2. On jupyter_server we are calling validate on an already validated notebook (it was validated on read)

So we could probably remove the extra validation on Jupyter_sever, or have an optional parameter on nbformat.read to not perform validation. Any thoughts on why validation is being performed twice @kevin-bates, @Zsailer, maybe there is a historical reason?

This would already cut the time in half. Now validation is still taking ~10 seconds which seems like really too much time.

Looking into validation now.

A potential 50% savings on notebook fetches is significant and something, I believe, we should take a closer look at! The methods in question (read and validate), were written 7 to 8 years so I can't comment on their history. I suspect the reason for the second validate call is because the read doesn't raise an exception on validation failures and the (then notebook) server would like to surface any potential validation issue to the frontend.

Unfortunately, 7 to 8 years is lots of water under the bridge and, as you point out, it seems like the only backward compatible change would be to add an optional parameter that indicates validation be skipped, knowing the server would circle back to perform validation itself. Had this been 8 years ago, it may have been better to raise ValidationError, but then the callers of read would not (unconditionally) receive the content, which, today, we also include in the invalid notebook model.

Both read() and reads() nbformat methods would need to be plumbed with an optional skip_validation parameter, but, assuming that can be done in a backward-compatible manner, seems worthwhile for 50% savings.

Another thing worth trying would be to set env NBFORMAT_VALIDATOR to fastjsonschema and see what kind of difference that makes. ((I see you added this right about the same time this thread was being discussed - thank you!)). If significant, we could add a configurable that makes that a bit easier to deal with (and defaults to fastjsonschema).

@goanpeca - would you be able to attach a copy of your '50000-errors.ipynb' notebook so we could have similar comparisons? Since you added fastjsonschema support, have you gained any other insights about this that might dissuade us from moving forward with this particular improvement? How did it improve the timings listed above?

kevin-bates avatar Feb 25 '22 21:02 kevin-bates

@goanpeca was working on this for some of our use cases.

If significant, we could add a configurable that makes that a bit easier to deal with (and defaults to fastjsonschema).

This has been great. We have been using NBFORMAT_VALIDATOR for ~1.5 years and have never had an issue with it. I'd recommend it as a the default as you suggested or at least changing the default to use it if it's installed: os.environ.get("NBFORMAT_VALIDATOR", "fastjsonschema" if fastjsonschema else "jsonschema")

Both read() and reads() nbformat methods would need to be plumbed with an optional skip_validation parameter, but, assuming that can be done in a backward-compatible manner, seems worthwhile for 50% savings.

I too think this is worthwhile too.

would you be able to attach a copy of your '50000-errors.ipynb' notebook so we could have similar comparisons?

We have made this public in jupyterlab/benchmarks: 50000-errors. FWIW, locally this takes just under 3s before it starts sending the file, even with fastjsonschema. If you use a newer version of lab, lab won't blow up on you due to the work done to limit the number of mime renders per cell.

I can't really comment on static asset performance

I can bring this idea up for dicussion at an upcoming juptyer-server meeting if that's a better way to move this foward.

mlucool avatar Feb 25 '22 21:02 mlucool

This is great info Marc - thank you!

This has been great. We have been using NBFORMAT_VALIDATOR for ~1.5 years and have never had an issue with it. I'd recommend it as a the default as you suggested or at least changing the default to use it if it's installed: os.environ["NBFORMAT_VALIDATOR"] = "fastjsonschema" if fastjsonschema else "jsonschema"

Given the stability you've seen, this seems like an obvious thing to do. We could discuss whether this should be an opt-in (or opt-out) feature - perhaps flipping the default on a major release, for example.

Both read() and reads() nbformat methods would need to be plumbed with an optional skip_validation parameter, but, assuming that can be done in a backward-compatible manner, seems worthwhile for 50% savings.

I too think this is worthwhile too.

Hmm, I see double validation is also performed when writing a notebook, so we'd probably want to do similar for those methods as well. I'm curious if @MSeal has an opinion on how to address the read/write methods - whether an optional "skip_validation" parameter would be reasonable or some other approach.

I can't really comment on static asset performance

I can bring this idea up for discussion at an upcoming jupyter-server meeting if that's a better way to move this forward.

That's probably best - thank you.

kevin-bates avatar Feb 25 '22 23:02 kevin-bates

I played around making fastjsonschema the default in a branch. Unfortunately it isn't a drop-in replacement for jsonschema, in that it does not recover as well from errors, since the errors it gives are opaque like "data must be valid exactly by one of oneOf definition". There is a feature request to add more information to the errors: https://github.com/horejsek/python-fastjsonschema/issues/72

blink1073 avatar Apr 02 '22 02:04 blink1073

From what I have seen, it's very rare to have an error. If you wanted to push this forward without waiting for https://github.com/horejsek/python-fastjsonschema/issues/72, you could use fastjsonschema and if that fails, fall back to jsonschema.

If not, maybe we can advertise this more anymore and mention the downsides.

mlucool avatar Apr 02 '22 11:04 mlucool

Yeah that makes sense, I'll try that logic.

blink1073 avatar Apr 02 '22 11:04 blink1073

cf https://github.com/jupyter/nbformat/pull/262

blink1073 avatar Apr 02 '22 12:04 blink1073

Bringing up a couple of ideas discussed on this request, in addition to the nbformat/fastjsonschema fix -

One idea is jupyter-server could have an opt-in-features that creates a directory of symlinks to static assets.

This has been brought up again during a discussion of compression of static assets (https://github.com/jupyterlab/jupyterlab/issues/13189). I'll try this setup and report back with numbers here.

suganya-sk avatar Oct 08 '22 03:10 suganya-sk