graphene-django icon indicating copy to clipboard operation
graphene-django copied to clipboard

Support for Async Django

Open jaw9c opened this issue 2 years ago • 46 comments

Feedback requested.

This PR adds support for Django async views. The general strategy follows the same principal as the base graphene library when supporting async. It implements the following things:

  • Django based resolvers now check if they are being executed in an async thread and if so wrap any ORM queries in sync_to_async
  • Resolvers now follow the same principals as the base graphene lib to handle resolving async and sync agnostically.
  • A new async view - credit to this PR

The current progress is as followed:

  • [x] Get the library working for all custom fields in graphene-django when running under async
  • [x] Duplicate the existing tests and edit them to have an async version

Other thoughts:

  • It's annoying to wrap all resolvers in sync_to_async if they are performing ORM ops. It would be nice to detect this error and throw a warning on existing codebases.

jaw9c avatar Apr 02 '23 02:04 jaw9c

Progress update: All the fields resolvers are supporting both being executed in the async context and wrapping django sync code into a sync context. I've made the assumption that any resolving of DjangoXType/Field fields will need to be executed in a sync context which I think is pretty safe. I've started running through all the tests and adding an extra assertion that running in an async executor produces the same result as the sync one.

Currently people switching to using the async view will need to wrap all their top level resolvers (those that are directly on the Query class passed into the schema) in @sync_to_async if they run sync code. If someone has an idea of how to prevent this that would be great, but can't seem to wrap my head around that!

Next steps are to run through all the test files and adding the helper.

It would great if people could test this either on their production projects (speciically when running under ASGI), as 'm interested in how swapping in/out of the sync context affects the performance of the execution.

jaw9c avatar May 05 '23 10:05 jaw9c

Progress update: All the fields resolvers are supporting both being executed in the async context and wrapping django sync code into a sync context. I've made the assumption that any resolving of DjangoXType/Field fields will need to be executed in a sync context which I think is pretty safe. I've started running through all the tests and adding an extra assertion that running in an async executor produces the same result as the sync one.

Currently people switching to using the async view will need to wrap all their top level resolvers (those that are directly on the Query class passed into the schema) in @sync_to_async if they run sync code. If someone has an idea of how to prevent this that would be great, but can't seem to wrap my head around that!

Next steps are to run through all the test files and adding the helper.

It would great if people could test this either on their production projects (speciically when running under ASGI), as 'm interested in how swapping in/out of the sync context affects the performance of the execution.

This is great news! Would discuss with my team to try this on production once its ready!

firaskafri avatar May 05 '23 11:05 firaskafri

@firaskafri Ready to go for reviews on this one! Managed to solve the problem of wrapping everything in sync_to_async by utilising a middleware. If an implementing project would like to use async they only need to do the following:

  • Swap out to using the Async View
  • Add the Async middleware
  • Check their schema for any types that are not extended from DjangoObjectType that have a resolver that uses the Sync parts of Django (including the top level Query type).
    • If these exist, when called they result in SynchronousOnlyOperation exception.
    • To fix this, the user has to wrap the resolver in sync_to_async.
  • 🚀 Done, enjoy the speedups!!

Looking forward to feedback!

jaw9c avatar May 16 '23 20:05 jaw9c

GraphiQL wasn't working for me on this branch, so I pushed a small change here https://github.com/UpliftAgency/graphene-django/commits/support-async (specific commit https://github.com/UpliftAgency/graphene-django/commit/8368ae2d86604995ef6ccd051d8b724b69a5bc35)

So far tested it with data loaders for a sample app and everything seems to be working. Thanks for your work on this!

pcraciunoiu avatar May 26 '23 14:05 pcraciunoiu

I spoke too soon, there is one issue I need to narrow down - sometimes related fields try to use sync context and break, e.g. I have a blog that has articles, and each article has an author. Sometimes article.author seems to get resolved sync using attr_resolver and that doesn't work in async context.

I haven't tried to isolate the issue yet because it doesn't always happen, but it could be fine to worry about it in a follow-up and get this merged!

pcraciunoiu avatar May 26 '23 14:05 pcraciunoiu

I think I will start testing next week on a large project, will get back with a feedback in it soon!

firaskafri avatar May 26 '23 20:05 firaskafri

Hi team! Thank you all for your great input and initiative! Could I ask please what are the plans to have this feature released? Is it feasible to release it in one-two month?

dima-kov avatar Jun 11 '23 08:06 dima-kov

@firaskafri

I think I will start testing next week on a large project, will get back with a feedback in it soon!

Any news on testing?

dolgidmi avatar Jul 10 '23 10:07 dolgidmi

Thanks for the input and testing guys! Super appreciated. Apologies for the lack of progress on this from me! Have been in firefighting mode recently and this slipped down the priorities.

The state of play from before is that I was testing the performance of async vs sync, both running under asgi and wsgi. During this I found a bug which made the app hang when running async on ASGI locally - I was struggling to debug this and is the first thing to focus on next.

If you'd like to help - the best way is to test on your own projects and get some relative performance stats under load that would be great. I'll likely be building a tool for my own projects that I can put in a gist but this will take time vs testing manually and I'd like to squash any complexity/bugs ASAP.

TL;DR; sorry for pausing work on this; if you want to help progress it, please test on your own projects and report any issues here.

jaw9c avatar Jul 19 '23 08:07 jaw9c

Good news: Fixed the hang issue. Bad news: The amount of time I wasted figuring out it was only an async incompatible middleware on our work env. I've run some preliminary load testing on our staging env, and got 2x throughput on resolving a queryset. Next up is some analysis on run times, I'll post these in a digestible format - hopefully by next week. Keen to get this merged soon.

jaw9c avatar Jul 24 '23 19:07 jaw9c

Good news: Fixed the hang issue. Bad news: The amount of time I wasted figuring out it was only an async incompatible middleware on our work env. I've run some preliminary load testing on our staging env, and got 2x throughput on resolving a queryset. Next up is some analysis on run times, I'll post these in a digestible format - hopefully by next week. Keen to get this merged soon.

Faced exactly the same issue, having only one sync middleware makes django hang on async handlers

dima-kov avatar Jul 24 '23 19:07 dima-kov

@jaw9c Thanks a lot for opening this up! It would be really, really nice if this could get reviewed and merged soon, as this is blocking us from being able to upgrade to the latest version of graphene-django.

cpd67 avatar Aug 12 '23 23:08 cpd67

@jaw9c any updates?

firaskafri avatar Oct 01 '23 16:10 firaskafri

Really need this!

pfcodes avatar Dec 21 '23 08:12 pfcodes

Hey mates, ready to donate some bucks via buymeacoffee if you make this PR to live!

dima-kov avatar Jan 07 '24 09:01 dima-kov

Hey mates, ready to donate some bucks via buymeacoffee if you make this PR to live!

I'm down to throw some $ too. Maintainers please consider enabling sponshorship for this project if it means that things can move more quickly. A lot of people are dependent on this.

pfcodes avatar Jan 08 '24 10:01 pfcodes

Migrating to v3 version, without making sure dataloaders and everything works, so that people need to implement new libraries (check @superlevure comments above)

alexandrubese avatar Jan 09 '24 19:01 alexandrubese

@alexandrubese so what's the way?

dima-kov avatar Jan 09 '24 19:01 dima-kov

@dima-kov Use an older version of django-graphene where things work?

Which might not be ideal because you’re basically just adding “legacy code” that you will need to update once they will make v3 work (although it’s been almost a year this is discussed, I don’t know when this will work and they will fix dataloaders)

I don’t jnderstand the intricacies of the v3 update, maybe they had to do it, that’s why it was maybe forced ?

Or use a different library

PS: I don’t know why the Django team didn’t work to add proper GraphQL support.

Spring, .Net MVC and many other “web frameworks” did it.

alexandrubese avatar Jan 09 '24 19:01 alexandrubese

Regarding the comment from @superlevure:

I agree with releasing AsyncGraphQLView.

We are almost there; the only remaining tasks are performance benchmarks and documentation updates. This implies that most of the work is already done, and only a few percentages are left.

@jaw9c, it seems you might be short on time for this. I'm willing to contribute efforts at this point.

@superlevure, to be transparent, I haven't conducted performance testing before, but I'm attempting it now. Any suggestions on how to approach this appropriately would be highly appreciated.

dima-kov avatar Feb 01 '24 18:02 dima-kov

Benchmarks: 16.547 seconds vs 83.194 seconds for i/o bound queries. 5x faster. async wins

Details are here: https://github.com/dima-kov/django-graphene-benchmarks?tab=readme-ov-file#tldr

Async:

Concurrency Level:      10
Time taken for tests:   16.547 seconds
Complete requests:      1000

vs

Sync

Concurrency Level:      10
Time taken for tests:   83.194 seconds
Complete requests:      1000

@superlevure are we fine now? What should be next steps to make this public?

dima-kov avatar Feb 02 '24 22:02 dima-kov

@dima-kov shouldn't you run sync benchmark with 10 workers instead of 4? I know that measuring sync vs async is hard but with 10 workers we would have consistent number of requests that are handled in the same time by the server so we should get actual comparison. I guess we should get similar results which is fine as the biggest benefit of using async is less resources used.

kamilglod avatar Feb 03 '24 02:02 kamilglod

hmm, that would eats lots of memory. from docs:

Gunicorn should only need 4-12 worker processes to handle hundreds or thousands of requests per second. Gunicorn relies on the operating system to provide all of the load balancing when handling requests. Generally we recommend (2 x $num_cores) + 1 as the number of workers to start off with.

I'm running on 4 cores machine, so trying gunicorn with 9 workers I've got only slightly better result:

Concurrency Level:      10
Time taken for tests:   80.785 seconds
Complete requests:      1000
Failed requests:        0

and memory usage was 9 processes ~80mb each = 720. The bottleneck here might be sqlite3 database.


vs async:

Concurrency Level:      10
Time taken for tests:   17.586 seconds
Complete requests:      1000
Failed requests:        0

and mem usage was 1 process 60mb.

dima-kov avatar Feb 03 '24 08:02 dima-kov

Sqlite docs says that concurrent writes are locked, but reads should be fine so I think it's not a problem with sqlite itself.

Maybe it's because of the differences in sync and async resolvers? In async you're fetching related octopus_type, in sync not. https://github.com/dima-kov/django-graphene-benchmarks/blob/main/project/api/schema/queries.py#L43-L51 https://github.com/dima-kov/django-graphene-benchmarks/blob/main/project/api/schema_async/queries.py#L48-L64

kamilglod avatar Feb 07 '24 07:02 kamilglod

oh, shame on me. so here is fixed version comparison:

Comparison

Sync Async Sync Async
Requests 1000 1000 1000 1000
Concurrency 100 100 100 100
Processes 9 1 4 1
Threads per proc 1 100 1 100
Mem ~720mb ~80mb ~320mb ~80mb
Time 23.384s 13.411s 24.465s 13.670s

dima-kov avatar Feb 07 '24 17:02 dima-kov

Now, it can be asserted that the releasing of the Async version will result in a twofold acceleration of I/O endpoints.

Moreover, take a look on this much fair (same resources) comparison:

Sync Async
Requests 1000 1000
Concurrency 100 100
Processes 9 9
Threads per process 1 1-30
Mem ~720mb ~720mb
Time 23.384s 4.719s

We encounter a tradeoff of x5 with identical resource utilization, excluding threads!

dima-kov avatar Feb 07 '24 17:02 dima-kov

Guys, we really need this (having async resolvers). Please tell us how can we help to make it public., I do not want to start using it unsure this is merged in main.

Or at least let us know your are going to release this, but we are missing: 1,2,3...

dima-kov avatar Feb 08 '24 20:02 dima-kov

Thanks for the benchmarks, I'll review the PR again tomorrow. Note that I don't have merge rights on this repo, we'll need a review from one of @firaskafri, @sjdemartini, @kiendang (or others)

superlevure avatar Feb 08 '24 20:02 superlevure

@superlevure i think it is good to go as soon as we clarify the docs What do you think @kiendang @jaw9c @sjdemartini

firaskafri avatar Feb 09 '24 08:02 firaskafri

@dima-kov I had a look at your benchmarks and I have few remarks.

First, it looks like you're comparing the sync and async resolvers on the same branch of graphene-django (this PR's branch). I believe it would actually be more fair to compare the async / sync versions of this branch and the sync version of graphene-django's main branch since this PR also affects sync resolvers and the point of the benchmarks is also to make sure no performance penalty is introduced to already existing code.

Second, I noticed the sync resolver is returning 500 objects while the async version is only returning 10 objects.

I took the liberty of pushing a PR to your repo that covers those point, as well as setting up a docker based environment and postgres as a DB to be closer to a real life use case.

I obtain the following results:

# Sync version [main]
Concurrency Level:      100
Time taken for tests:   45.517 seconds
Complete requests:      1000
Failed requests:        0
Non-2xx responses:      1000
Total transferred:      60232000 bytes
Total body sent:        238000
HTML transferred:       59963000 bytes
Requests per second:    21.97 [#/sec] (mean)
Time per request:       4551.739 [ms] (mean)
Time per request:       45.517 [ms] (mean, across all concurrent requests)

# Sync version [this branch]
Concurrency Level:      100
Time taken for tests:   203.300 seconds
Complete requests:      1000
Failed requests:        0
Total transferred:      45805000 bytes
Total body sent:        244000
HTML transferred:       45382000 bytes
Requests per second:    4.92 [#/sec] (mean)
Time per request:       20330.019 [ms] (mean)
Time per request:       203.300 [ms] (mean, across all concurrent requests)

# Async version [this branch]
# didn't run till the end, see below

Data tends to show a huge perf hit for the sync version of this PR. Concerning the async version, I run into the following DB error which which makes all requests fail until I restart the DB:

2024-02-11 16:31:39.310 UTC [4579] FATAL:  sorry, too many clients already

I suspect the async version is leaving DB connections opened somewhere (Postgres max_connections settings is set at 100 which is the concurrency level used in the benchmark)

I'll continue to play with the PR a bit tonight to make sure there's nothing wrong with my setup, but I'm curious if others can reproduce similar results.

superlevure avatar Feb 11 '24 16:02 superlevure