asgiref icon indicating copy to clipboard operation
asgiref copied to clipboard

Cache the imported AsyncToSync and SyncToAsync classes on the Local class

Open kezabelle opened this issue 2 years ago • 9 comments

See issue #269.

On first access, after they're imported the first time, hold a reference to them on the class itself. This nets a small increase in performance by avoiding going through the import machinery on every attribute access of a Local instance.

Note that these are bound directly to the Local class rather than testing the instance (self) attributes, as doing so in the simple way would yield a RecursionError due to __getattr__ depending on _get_context_id

Using the same example as I provided in the issue:

In [1]: from asgiref.local import Local
   ...: x = Local()
   ...: def timeonly():
   ...:     setattr(x, 'test', None)
   ...:     for _ in range(100_000):
   ...:         getattr(x, 'test')
In [2]: %timeit timeonly()
329 ms ± 5.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [3]: %prun timeonly()
...
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   100001    0.139    0.000    0.313    0.000 local.py:47(_get_context_id)
   100000    0.080    0.000    0.502    0.000 local.py:107(__getattr__)
200001/100001    0.076    0.000    0.568    0.000 {built-in method builtins.getattr}
   100001    0.073    0.000    0.407    0.000 local.py:88(_get_storage)
   100001    0.067    0.000    0.123    0.000 sync.py:493(get_current_task)
   100001    0.029    0.000    0.044    0.000 tasks.py:34(current_task)
   100001    0.026    0.000    0.038    0.000 threading.py:1318(current_thread)
   200002    0.023    0.000    0.023    0.000 {built-in method builtins.hasattr}
...

vs prior to the patch:

In [2]: %timeit timeonly()
448 ms ± 8.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [3]: %prun timeonly()
...
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   100001    0.255    0.000    0.490    0.000 local.py:46(_get_context_id)
   100000    0.085    0.000    0.685    0.000 local.py:101(__getattr__)
200001/100001    0.076    0.000    0.750    0.000 {built-in method builtins.getattr}
   100001    0.072    0.000    0.584    0.000 local.py:82(_get_storage)
   100001    0.070    0.000    0.133    0.000 sync.py:493(get_current_task)
   100001    0.036    0.000    0.050    0.000 <frozen importlib._bootstrap>:398(parent)
   100001    0.030    0.000    0.046    0.000 tasks.py:34(current_task)
   200002    0.029    0.000    0.029    0.000 {built-in method builtins.hasattr}
   100001    0.027    0.000    0.040    0.000 threading.py:1318(current_thread)
...

This may still not be what Andrew was specifically hoping for, but given touching self.x in the simple/naive way is problematic (see above), I'm opting for this as the simplest possible solution from which discussion for improvement can follow.

kezabelle avatar Aug 16 '21 11:08 kezabelle

@smithdc1 get pinged. As you unintentionally reminded me about this, it might be of interest to you. It might not. It's on your radar now though either way :)

kezabelle avatar Aug 16 '21 11:08 kezabelle

Ooh, I get the point of this, but it feels a little weird to look at. That said, you're entirely right, imports are expensive to do at runtime.

I think ideally I'd want a cached_import function that just does this logic internally, but that might be a little heavyweight for this use case. Were there any alternate ideas you considered?

andrewgodwin avatar Aug 16 '21 15:08 andrewgodwin

Were there any alternate ideas you considered?

Well, yes and no. The worst kind of answer :) There was the similar initial idea of just pulling from sys.modules in the original ticket, and I've tried doing it on self (stalling due to additional complexity of having to modify __getattr__), and at the end of __init__ (failing because it's partially init'd), via __init_subclass__ (same partially init'd problem) etc.

I think ideally I'd want a cached_import function that just does this logic internally, but that might be a little heavyweight for this use case.

Depending on what you mean, I can see that working. I could put the implementation roughly as-is behind a staticmethod for example (about the lowest cost callable on a class IIRC), and thus keep its locality to the class and roughly equivalent improvement-wise. Throwing it into a wholly separate function would require either an LRU cache (of size 1 I suppose) or a memoization param I guess, or it would continue to mutate the class but be apart from it, and I don't know that that is better (IMHO).

kezabelle avatar Aug 16 '21 17:08 kezabelle

Honestly, having reconsidered it, maybe it is worth having a "lazy import" wrapper to keep this easier to read, and because asgiref is in the business of small optimisations at this point.

I feel like the sys.modules option is very doable?

def cached_import(module_name, item_name):
    if module_name not in sys.modules:
        importlib.import_module(module_name)
    return getattr(sys.modules[module_name], item_name)

Then I believe we could just do AsyncToSync = cached_import("asgiref.sync", "AsyncToSync")

andrewgodwin avatar Aug 16 '21 18:08 andrewgodwin

Well, YMMV (between Python version differences and machine differences I can't guarantee this) but AFAI can tell, this is faster:

@staticmethod
def get_sync_classes():
    if not Local.launch_map_classes:
        from .sync import AsyncToSync, SyncToAsync
        Local.launch_map_classes = (AsyncToSync, SyncToAsync)
    return Local.launch_map_classes

def _get_context_id(self):
    AsyncToSync, SyncToAsync = self.get_sync_classes()

than doing this:

def _get_context_id(self):
    AsyncToSync = cached_import("asgiref.sync", "AsyncToSync")
    SyncToAsync = cached_import("asgiref.sync", "SyncToAsync")

Mostly because it's avoiding that second invocation I suspect. Across the original ipython example upthread, it's consistently a difference; first the cached_import method:

In [2]: %timeit -n10 -r5 timeonly()
357 ms ± 3.88 ms per loop (mean ± std. dev. of 5 runs, 10 loops each)
In [3]: %timeit -n10 -r5 timeonly()
374 ms ± 4.04 ms per loop (mean ± std. dev. of 5 runs, 10 loops each)
In [4]: %timeit -n10 -r5 timeonly()
377 ms ± 4.35 ms per loop (mean ± std. dev. of 5 runs, 10 loops each)

and then the staticmethod version:

In [2]: %timeit -n10 -r5 timeonly()
332 ms ± 4.2 ms per loop (mean ± std. dev. of 5 runs, 10 loops each)
In [3]: %timeit -n10 -r5 timeonly()
342 ms ± 3.3 ms per loop (mean ± std. dev. of 5 runs, 10 loops each)
In [4]: %timeit -n10 -r5 timeonly()
339 ms ± 3.08 ms per loop (mean ± std. dev. of 5 runs, 10 loops each)

I don't have a particular aversion to doing the cached import way if you prefer, I'm just here looking to reduce the distance between a threading.local and a Local for the same span of work (where the latter notably has to do a bunch more work which I don't fully understand ;)).

For reference, because I've not previously provided it (tut!), the baseline for a threadlocal on my same machine is 9.37 ms ± 1.39 ms per loop (mean ± std. dev. of 5 runs, 10 loops each) which is using the C implementation, so is inherently unfair but is what is being 'competed' against. The pure python (_threading_local.local) version is 279 ms ± 4.08 ms per loop (mean ± std. dev. of 5 runs, 10 loops each) for comparison.

kezabelle avatar Aug 16 '21 19:08 kezabelle

How about a cached_import variant that lets you have a list of items to import rather than a single item?

Honestly, I'm not in this to make this hyper-competitive with even the normal concept of threadlocal - having to keep track of the various inheritance paths across threads is always going to make this slower. It's just here as a stopgap to allow projects like Django to convert without having to make everything a contextvar up front.

andrewgodwin avatar Aug 18 '21 04:08 andrewgodwin

How about a cached_import variant that lets you have a list of items to import rather than a single item?

Seems to be a bit slower than doing 2 separate cached_import calls, unfortunately. Presumably the book-keeping of creating a listcomp or genexp (return [getattr(module, item_name) for item_name in item_names]) is a net loss, but I didn't dissect it much. It rolled out at an average of 370-380ms give or take across multiple distinct runs. So better than baseline, but worse than other variations, and maybe drawing equal with 2 separate cached_import calls. Hard to say given the timescales and a laptop which will start to throttle if I run too many tests in quick succession ;)

Honestly, I'm not in this to make this hyper-competitive with even the normal concept of threadlocal - having to keep track of the various inheritance paths across threads is always going to make this slower.

Absolutely. The stark difference between the C threadlocal implementation and the Python one suggests that the only step which would make it competitive is shudder rewrite it in C. And I promise you I'm not asking anyone to do that :) The best that could feasibly be done is "competitive with the python threadlocal implementation" (having looked at the hoops jumped through, I'm not even going to suggest going toe-to-toe with it, it's far more arcane than asgiref.Local) so once you're edging close to the 300ms region (on my machine), you're probably at best-you'll-get.

It's just here as a stopgap to allow projects like Django to convert without having to make everything a contextvar up front.

Sure, I understand that (and it's an admirable accomplishment to have got there. I can't state it enough) but I personally don't know what the timeframe of that migration might be for any single project (eg: Django's next minimum version is 3.8 I think ...), let alone multiple dependents. I'm just wanting to try and minimise the overall cost of the transition period, however long that might be (well, Django 3.2 LTS is good until what, 2024 I think?). And I don't think it's possible to phrase it without sounding rude, but it's not meant to be: The downstream users (developers, customers) of projects like Django didn't ask for a replacement for threadlocal (except where they did...) -- though they may ultimately appreciate getting one, sooner or later -- in the same way no-one asks for the python version of threadlocal when they import it, it's just what's transparently given if the faster one isn't available

I don't think we're talking at cross purposes, but I think we've maybe got different ideas on goals, which wouldn't be surprising given only one of us has to maintain it -- cards on the table (and with utter respect, lest text not convey it) I don't see the point of the proposed cached_import mechanism; ignoring it being slower (for my single machine benchmarks, YMMV naturally), it's a utility function whose only usage would be to solve this "problem" until such time as another circular import is necessary, which might never happen. In which case it's perhaps neater, but not any more useful.

As such, at this juncture, I'd recommend that we stop pitching ideas at each-other (and missing) and suggest just implementing the one you feel comfortable with supporting which gives the best improvement, which is cached_import as far as I can tell, and I think the 2-separate-calls version, though I'll have to look more rigorously. Saving 70ms or so over 100,000 getattrs isn't bad, and saving another 30-50ms in a way it appears you dislike probably isn't worth our combined time & effort.

If only mypyc or cython were helpful here and we could ship prebuilt ... (they're not, I checked. cython nets you about another 20ms naively compiling the extant file, and mypyc just doesn't work because most dunder methods don't get assigned correctly)


Edit to add: those criticisms of cached_import only apply to this specific discussion though. It's elegant enough and improvement enough that used en-masse across multiple things it'd probably be worthwhile, and I could see it being a benefit to a project which does more circular import shenanigans, like say, Django proper, and perhaps regardless of outcome here, it's worth considering there. Off the top of my head, get_normalized_value gets called a bunch and has to go through the import machinery every time, for example.

kezabelle avatar Aug 18 '21 11:08 kezabelle

Fair enough - let's just go with double cached_import for now then.

andrewgodwin avatar Aug 28 '21 20:08 andrewgodwin

FYI, the current version of cached_import() which was added to Django recently looks like this:

import sys
from importlib import import_module

def cached_import(module_path, class_name):
    # Check whether module is loaded and fully initialized.
    if not (
        (module := sys.modules.get(module_path)) and
        (spec := getattr(module, '__spec__', None)) and
        getattr(spec, '_initializing', False) is False
    ):
        module = import_module(module_path)
    return getattr(module, class_name)

I guess as this package supports Python < 3.8 we'd need to convert to something like the following:

import sys
from importlib import import_module

def cached_import(module_path, class_name):
    # Check whether module is loaded and fully initialized.
    ready = False
    module = sys.modules.get(module_path)
    if module:
        spec = getattr(module, '__spec__', None)
        if spec and getattr(spec, '_initializing', False) is False:
            ready = True
    if not ready:
        module = import_module(module_path)
    return getattr(module, class_name)

ngnpope avatar Oct 25 '21 18:10 ngnpope