django icon indicating copy to clipboard operation
django copied to clipboard

Fixed #32172 -- Adapted signals to allow async handlers.

Open kozzztik opened this issue 3 years ago • 18 comments

Proposed implementation for ticket https://code.djangoproject.com/ticket/32172

I tried different options:

  1. Make adoption inside of "send" method. In this case implementation became very complex. We need to send information about what kind of call (sync or async) we expect, and we can do that only through kwargs. However kwargs are preserved for signal needs, and adding new one possibly can clash with names on django-based projects. So I decided to make separate implementation.

  2. Parallel execution. In asyncio way it will be logical to run all receivers in parallel to get full power of asyncio. However, as sync_to_async creates a thread for every sync handler in receivers, it possibly can create much of threads that can completely kill performance.

So finally - that is separate method with almost same logic, adapted for asyncio.

kozzztik avatar Nov 07 '20 10:11 kozzztik

FYI @carltongibson

kozzztik avatar Nov 09 '20 14:11 kozzztik

Hi @kozzztik, yes I've seen it thanks— good job! — I was taking the weekend off though, and haven't got round to this just yet. 😀

@andrewgodwin Can I ask for your initial thoughts here?

  • I guess we do want to the send() call to be awaitable, so we can hand off control
  • Do we want to allow (existing sync) send() to be able to handle either type of handler too (like middleware)?

carltongibson avatar Nov 09 '20 15:11 carltongibson

Yup, the format of send_async() there looks good - matches the convention we're going for of name_async, and we can't override pure send() for obvious reasons.

I would like to see both send() and send_async() handle both kinds of receivers, though, that's for sure. I'd also like to make sure we do some performance testing on this, as signals have been a performance bottleneck in the past.

andrewgodwin avatar Nov 09 '20 16:11 andrewgodwin

@carltongibson @andrewgodwin I added support of async receivers to sync methods too (both send and send_robust)

kozzztik avatar Nov 10 '20 10:11 kozzztik

@carltongibson @andrewgodwin do we need anything else?

kozzztik avatar Nov 12 '20 08:11 kozzztik

HI @kozzztik — I haven't had a chance to look at this yet. Will get to you as soon as I can. Thanks for your patience.

carltongibson avatar Nov 12 '20 08:11 carltongibson

@andrewgodwin added documentation. However I'm not native speaker, so feel free to correct this changes.

kozzztik avatar Nov 15 '20 12:11 kozzztik

Hi @kozzztik — thanks for this, it looks good. I'm looking at tweaks to the documentation now.

Whilst I have you and @andrewgodwin here though, I wanted to quickly point to the response.close() call in send_response() that dispatches request_finished:

https://github.com/django/django/blob/9f72b0970db4be76d14bd7be5a21582d2cc484bc/django/core/handlers/asgi.py#L262

This would be a separate ticket but, @andrewgodwin do you have thoughts here? (Do we need a close_async() as well? 😬) — I ask since I'm keen to see if we can get to fully async request-response pathway, and that looks like the last blocker. (Yes, you have to remove all sync signals, and middleware, but it would be available at least.)

Then there's this:

I'd also like to make sure we do some performance testing on this, as signals have been a performance bottleneck in the past.

At the least we need to verify we're not slowing down the WSGI pathway… 🧐

carltongibson avatar Nov 19 '20 11:11 carltongibson

I guess we need close_async as it hides signals.request_finished.send inside. Dream to have AsyncHttpResponse, but that is too much ) Who can do performance check (or share me info how to do it, but not sure it will be faster)? I don't know how it usually done.

kozzztik avatar Nov 19 '20 11:11 kozzztik

@kozzztik I was just pondering that.

I think just a simple app that creates several hundred objects in a view, just logging in handlers..., then deleting them.

Run with timeit or Apache bench. Make sure it's not slower vs master.

I think that would be enough.

Happy to help on a simple app if you want?

carltongibson avatar Nov 19 '20 12:11 carltongibson

@carltongibson, hah, just suggested there is known way to do such tests ) Ok, don't worry, I will build an small app for that.

kozzztik avatar Nov 19 '20 13:11 kozzztik

Super. Thanks for your input @kozzztik 🏅

carltongibson avatar Nov 19 '20 13:11 carltongibson

I've done performance tests and thats crazy. Tests are here and results are here

There are 3 types of tests:

  1. Pure signal test. Create a signal, with 100 recievers and measure time needed for send.
  2. Full cycle WSGI. Create Django wsgi instance and execute simple empty view. 100 recievers added for request start and close signals. Time for request measured.
  3. Same as 2, but by ASGI application.

All test was run on master and this branch to compare. Each test configuration (test+branch+options) was run in 5 series with 1000-10000 requests.

In summary:

  1. Pure send became twice (2.05-2.27) times slower. That is expected, it is always a trade, and that is a place that should became slower, in favor of all working faster. I tried to replace receivers cycle with huge generator expression in send, It became better (2.18), but not much.
  2. WSGI with sync handlers became 1.47 times slower with 100 receivers. As it is not real expected case to have so much signals communication, I reduced number of signals to 10 (20 calls together on start and close request). It became better, 1.10, but still is notable.
  3. ASGI. Here everything became worse. At first, it just stucks in 50% cases. I can't even stop it in debugger to analyze deadlock. If it even works I decreased number of requests from 100k to 10k and after that it gives comparable timings of with WSGI, so it is finally about 10 times slower even for master branch. Also timings may vary very much, it is seen on deltas between measures and average value. And thats not all... When I switched to async receivers everything became 10 times more slower. I inspected what is going on and found this line So blocking=false and time.sleep gives no chance for performance ) I see that was fixed while I done tests, so I will try to repeat with new version.

kozzztik avatar Nov 20 '20 13:11 kozzztik

Finally, even it became better, I'm not sure that wrapping sync-async inside of django internals is good idea. Its eating performace very much. But profit is not obvious. For example, revciever can be sync, but have no blocking operations inside. Guess that happens very often. On my project almost everytime. But wrapper with create a thread for it. Current implementations works faster, as it wraps with thread only one time - on send. This new implementation do it on every receiver, that is overkill. I'm think better to not wrap anything and just run sync code "as-is" even on async implementation. Developers can use wrappers themselves if they are actually needed.

kozzztik avatar Nov 20 '20 14:11 kozzztik

Unfortunately we have to think about safety - just running sync code "as is" will break any previous assumptions developers may have made about connections and the blocking nature of things.

One thing we could try - do a single sync_to_async call that wraps all sync code, and vice-versa for async handlers? It's not going to be a massive improvement, but it'll help.

The main thing we need to be sure of performance-wise is that the synchronous case is not really much slower (i.e. existing projects); I'm curious where the 1.10 multiplier you mentioned in this case comes from. Do you think it's literally the extra call to if asyncio.iscoroutinefunction(receiver): doing it, or something else?

andrewgodwin avatar Nov 21 '20 17:11 andrewgodwin

@andrewgodwin yep, it comes from this call. As originally send is just similar to map operation, it is very cheap. Adding anything to its logic significally increases its complexity and looks like asyncio.iscoroutinefunction is not so cheap operation which is done on every receiver call. If you have empty application that actually does nothing, this calls became notable. Also, finally I don't understand what sense does it have to write async receivier for signal that send in sync way. It wouldn't work async finally, even after async_to_sync. It will just make it work while finally it shouldn't.

What assumptions can be broken? When I write blocking application I assume that everything is not thread safe. If I have shared resource, I should care about safety by myself. But otherwice, when when I write async code everything is run in one thread (if you woudn't call separate thread by yourself), and between two await calls code is thread safe "just because", thats why it is so cool to write on asyncio - you can forget about all thread safe problems. No locks, no treads, and thats gives a performance. But in current implementation, code may behave not obvious. If you will call sync send and write sync receiver, async_to_sync will run it in parallel thread, making it not thread safe.

In addition, if you have async code, and just forgot to send signal in async way. Async view, async reciever, and sync send. In this case your async reciever will be run in separate loop and thread, what is very unexpected behavior, as there is finally no reason for that. And it will break its work - it will be impossible in receiver to await any future from original loop (for example reuse existing connection), it will raise exception. And it is hidden, you wouldn't find it until something wouldn't be broken.

That is enough in developing async application to take care all the time to not do blocking things, as it kills perfomance. This sync/async transitions add few new ways to shoot a leg.

kozzztik avatar Nov 21 '20 18:11 kozzztik

As originally send is just similar to map operation, it is very cheap. Adding anything to its logic significally increases its complexity and looks like asyncio.iscoroutinefunction is not so cheap operation which is done on every receiver call. If you have empty application that actually does nothing, this calls became notable.

@kozzztik Why can't we run the asyncio.iscoroutinefunction check at the time of signal registration here and store this information for later use by send? https://github.com/kozzztik/django/blob/3f73f1b71a263ec6d3f7c9b295456a6cf39dc50b/django/dispatch/dispatcher.py#L56

Also, finally I don't understand what sense does it have to write async receivier for signal that send in sync way. It wouldn't work async finally, even after async_to_sync. It will just make it work while finally it shouldn't.

Agreed, there's no performance gain at all and I believe the end-developer may have done that by mistake. But it may very well be possible that some registered receivers for that signal are sync and others async. That case is a bit tricky, IMO, since we can't expect developers to update their usage everywhere. So I guess that check needs to be there, even if it's redundant in terms of performance.

shivanshs9 avatar Dec 03 '20 08:12 shivanshs9

Hey ) I found performance very interesting subject, so I decided to make it more carefully, and got intersting results. I wrote new test, that is much more accurate and checks performance in additional situations. Test runs django app in separate process to be sure, that testing code would not affect original performance. Also it uses a network protocol abstraction that gives possibility to correctly test different WSGI/ASGI implementations in same environment so they can be correctly compared. Both sides - test client and server works on asyncio to make possible use of efficient parallel requests.

I run tests for following implementations:

  1. WSGI(multithread) - emulates WSGI server for Django WSGI app. to support parallel requests, every request is put under loop.run_in_executor, so it have thread pool as normal WSGI server have.
  2. WSGI(single thread) - that is "control" implementation that do not use loop.run_in_executor, and just handles request as-is blocking server loop for response. Idea of implementation to check, how much performance required to support thread pool.
  3. ASGI - Django ASGI implementation. Run in server loop through await as it is usually done by ASGI server.
  4. FAST_ASGI. As I had some concerns about general ASGI handler solution, I wrote alternative handler implementation. It have no middleware adoption (just raises exception if middleware does not support async) and not uses sync_to_async for sending signals. So, finally, it have only one adoption - loop.run_in_executor for running sync view. Idea is same as with single thread WSGI - what performance effect all this staff have.

For this implementations I run four type of tests. Each test makes is 10 000 requests by view URL with 200 signal receivers added. Each receiver have its own call counter to check that everything works correct. Test is repeated 10 times to check that it gives similar timings each time. For tests, that run too long (~10 minutes) it is run only 3 times. In tests I used asgiref package with fixed freeze problem. Test types:

  1. Calling async view in one thread
  2. Calling sync view in one thread
  3. Calling async view in 100 parallel threads. View makes 0.05 async sleep to emulate blocking operation.
  4. Calling sync view in 100 parallel threads. View makes 0.05 blocking sleep to emulate blocking operation. All tests run for master and this branch. That gives 4 implementations * 4 test types * 2 branches = 32 test results. Also I run all tests for current branch again, but with async signal recievers. That gives 48 test situations in total. For parallel tests, as they have "sleeps" inside, I calculated "pure timings" to get pure implementation overhead similar with tests without sleeps. It is calculated as timing -0.05s * 10000 requests / 100 threeads.

Tests results are here on "parallel" page.

Results summary:

  1. For master branch:
  • After asgiref fix, ASGI implementation became stable. However something is still wrong with async_to_sync transition, it still have freezes. Not sure, but looks like it is platform specific problem asyncio itself. But that just a guess.
  • Even if async_to_sync not freezes, it is doesn't works under server loop. Everything is fine in "normal" WSGI, as it uses separate threads for requests, but in single threaded implementation does not works with any async staff. That is not a problem itself as it is just test implementation, but creates strange assumptions on WSGI server implementation that it should not have an active loop in request thread. More interesting starts when we add async signal receivers. Only ASGI implementation correctly works with it, and only on one signal - request_started. In all other situations exception rised. As I understand, originally idea of async_to_sync was to run async receivers in any sync environment, but looks like it is not so good on practice.
  • WSGI implementation is slow on processing async requests (12 seconds over 7.6s ASGI processing sync requests), Guess if freeze problem will be fixed, that will became better too.
  • Adoptions eat performance very much. Implementation without adoptions about 1.5-2 times faster on both sync and async requests.
  • On parallel execution ASGI implementation extremely slow on sync views. In timings its slower that single thread blocking WSGI implementation. I found that problem finally in this line https://github.com/django/asgiref/blob/ee967d2e3be77bb4c5082542a538e1c77a37a031/asgiref/sync.py#L278 . This creates only one thread for processing sync views. As I understand, idea here is in thread safety. But finally that makes ASGI just single threaded blocking thing completely killing all both threading and async performance profits.
  1. On this branch:
  • On one threaded tests performance much slower in all implementations, about +30% to timings.
  • On ASGI implementation this branch x15 slower with sync receivers. Problem is in bottleneck described above + thread task creating for every signal receiver call. Also that is makes async "break point" that returns flow control to loop on every call.
  • What really strange, is that on parallel execution performance is a liitle bit faster (<1%). I don't know why parallel execution have such effect, while single threaded much slower. May be some python internal optimizations triggered, or some optimization in modern processors (i run it on one of last Ryzen CPU), but something "eats" all overhead. It works even on "fast asgi" implementation that is pure asyncio, and fully single threaded, what means that there is no really "parallel" calls. I can only guess that reason is that in single threaded test server sleeps between calls waiting client to receive answer and send next request. That activates context switching and maybe reason somewhere there. But that is only a guess. Very strange thing, I checked three times to get sure that code is from correct branch is loaded, but that is true.

@shivanshs9 - good idea! I will make more - try to make adoption for both sync and async on signal connection stage. That should have minimum overhead.

However, looks like mixing both sync and async together is not good idea. That is opposite ideas that is not mean to work together. What do you think guys, on alternative idea: on sync signal send call only sync receivers, and on async call - async receivers. In that way you can write correct signal reaction implementation for sync and async if you not sure how signal will be send, and connect both. Only needed one will be called.

kozzztik avatar Dec 12 '20 13:12 kozzztik

I've opened a new PR here to pick up where this left off: https://github.com/django/django/pull/16547

bigfootjon avatar Feb 12 '23 19:02 bigfootjon