httpcore icon indicating copy to clipboard operation
httpcore copied to clipboard

Use `asyncio` synchronization instead of `anyio`

Open MarkusSintonen opened this issue 1 year ago • 37 comments

Summary

First PR in series of optimizations to improve performance of httpcore and even reach performance levels of aiohttp. Related discussion https://github.com/encode/httpx/issues/3215#issuecomment-2157885121

Previously: old1

With PR: new1

Request latency is more stable and the overall duration of the benchmark improves by 3.5x.

The difference diminishes when server latency increases over 100ms or so.

Checklist

  • [x] I understand that this PR may be closed in case there was no previous discussion. (This doesn't apply to typos!)
  • [x] I've added a test for each change that was introduced, and I tried as much as possible to make a single atomic change.
  • [x] I've updated the documentation accordingly.

MarkusSintonen avatar Jun 10 '24 13:06 MarkusSintonen

@MarkusSintonen Thanks so much for your work on this. Could we isolate the benchmarking into a separate PR?

lovelydinosaur avatar Jun 10 '24 13:06 lovelydinosaur

@MarkusSintonen Thanks so much for your work on this. Could we isolate the benchmarking into a separate PR?

Added here https://github.com/encode/httpcore/pull/923

MarkusSintonen avatar Jun 10 '24 13:06 MarkusSintonen

Great, Also, is this mean we no more need anyio dependency for httpx[asyncio] installation on Python>=3.11 ?

Yes thats right, once we also bring back the native asyncio based network to _backends.

Actually I fixed the code now so that anyio is not required either in Python<3.11

MarkusSintonen avatar Jun 11 '24 07:06 MarkusSintonen

Yes thats right, once we also bring back the native asyncio based network to _backends.

I'll wait you to create a PR from your commit to track and review it separately.

T-256 avatar Jun 11 '24 12:06 T-256

This PR highlights to me that the approach to shielding here is wrong. We're using cancellation shielding solely to deal with shielding connection closes. Rather than use shielding what we actually ought to be doing is...

  • Unconditionally marking the connection state as closed. (Doesn't require I/O, can't be cancelled.)

And then...

  • Performing the network stream close. (If we perform a cancellation operation then it's reasonable that SSL connections might not be "gracefully" closed.)

This is a bit fiddly to get right, needs some careful attention to detail.

lovelydinosaur avatar Jun 13 '24 10:06 lovelydinosaur

@tomchristie do you mean the previous contextmanager based shielding also had issues? Requiring additional reordering of closing handlers? If it didnt have issues then we could still use the anyio CancelScope just for shielding? But it means the anyio dependency cant be fully removed.

MarkusSintonen avatar Jun 13 '24 15:06 MarkusSintonen

Or yeah did you mean the non-IO related state management should be extracted out from the async closing when shielding with asyncio.shield. That's probably right

MarkusSintonen avatar Jun 13 '24 15:06 MarkusSintonen

@MarkusSintonen I've attempted to answer my intent here through implementation. See #927. (Currently draft, and approach would need to be applied also to the HTTP/2 implementation, and the closing behaviour in the connection pool)

lovelydinosaur avatar Jun 13 '24 16:06 lovelydinosaur

@MarkusSintonen I've attempted to answer my intent here through implementation. See #927. (Currently draft, and approach would need to be applied also to the HTTP/2 implementation, and the closing behaviour in the connection pool)

Big thanks! Now I understand, so yeah it was indeed about "separating" the sync state management and the async IO part. Ie the streams aclose is the cancellable part and the state management being separate from async flow can not be cancelled.

MarkusSintonen avatar Jun 13 '24 17:06 MarkusSintonen

Would you mind testing my latest PR branch (https://github.com/agronholm/anyio/pull/761)? In my own testing, it about halved the performance overhead of anyio.Lock on asyncio, and was actually faster than asyncio's own native Lock class.

agronholm avatar Jul 18 '24 21:07 agronholm

Would you mind testing my latest PR branch (agronholm/anyio#761)? In my own testing, it about halved the performance overhead of anyio.Lock on asyncio, and was actually faster than asyncio's own native Lock class.

Nice, thank you! I can try sometime next week.

MarkusSintonen avatar Jul 19 '24 09:07 MarkusSintonen

Would you mind testing my latest PR branch (https://github.com/agronholm/anyio/pull/761)? In my own testing, it about halved the performance overhead of anyio.Lock on asyncio, and was actually faster than asyncio's own native Lock class.

@agronholm I run the tests and it is indeed better with the fix. But seems there is still some overhead with the implementation when comparing to asyncio:

Baseline (Anyio 4.4.0): baseline

Anyio PR: branch

Asyncio: asyncio

(Above tests were run with httpcore optimizations from other pending PRs to not count in the issues from httpcore itself)

MarkusSintonen avatar Jul 22 '24 18:07 MarkusSintonen

Would you mind testing my latest PR branch (agronholm/anyio#761)? In my own testing, it about halved the performance overhead of anyio.Lock on asyncio, and was actually faster than asyncio's own native Lock class.

@agronholm I run the tests and it is indeed better with the fix. But seems there is still some overhead with the implementation when comparing to asyncio:

Baseline (Anyio 4.4.0): baseline

Anyio PR: branch

Asyncio: asyncio

(Above tests were run with httpcore optimizations from other pending PRs to not count in the issues from httpcore itself)

Have you identified where this overhead is coming from? Can we eliminate synchronization as the cause after that PR?

agronholm avatar Jul 23 '24 09:07 agronholm

If you run the benchmarks after replacing the locks and semaphores in the asyncio implementation with the corresponding AnyIO primitives, that should shed some light on the issue.

agronholm avatar Jul 23 '24 09:07 agronholm

If you run the benchmarks after replacing the locks and semaphores in the asyncio implementation with the corresponding AnyIO primitives, that should shed some light on the issue.

It seems to be the anyio.Lock. When I just replace it with asyncio.Lock the issue disappears (with all the rest still using anyio variants). (HTTP2 also utilizes the Semaphore but I haven't tested that here.)

MarkusSintonen avatar Jul 23 '24 14:07 MarkusSintonen

In the AnyIO variant, the acquisition of an unowned lock would be slower because this operation contains a yield point unlike in the asyncio variant. Would you mind testing against AnyIO lock after commenting out the checkpoint calls around lines 1690-1697 in backends/_asyncio.py?

agronholm avatar Jul 23 '24 14:07 agronholm

In the AnyIO variant, the acquisition of an unowned lock would be slower because this operation contains a yield point unlike in the asyncio variant. Would you mind testing against AnyIO lock after commenting out the checkpoint calls around lines 1690-1697 in backends/_asyncio.py?

Indeed, it is happening because of the await AsyncIOBackend.cancel_shielded_checkpoint() (at line 1694)

MarkusSintonen avatar Jul 23 '24 16:07 MarkusSintonen

Yeah, so the problem here is that this yield point is needed to maintain the same semantics as with Trio. This is Trio's Lock.acquire():

    @enable_ki_protection
    async def acquire(self) -> None:
        """Acquire the lock, blocking if necessary."""
        await trio.lowlevel.checkpoint_if_cancelled()
        try:
            self.acquire_nowait()
        except trio.WouldBlock:
            # NOTE: it's important that the contended acquire path is just
            # "_lot.park()", because that's how Condition.wait() acquires the
            # lock as well.
            await self._lot.park()
        else:
            await trio.lowlevel.cancel_shielded_checkpoint()

If there wasn't a yield point, then this could become a busy-wait loop:

async def loop(lock):
    while True:
        async with lock:
            print("I'm holding the lock")

agronholm avatar Jul 23 '24 17:07 agronholm

If there wasn't a yield point, then this could become a busy-wait loop:

So this is needed for the CancelScope logic in AnyIO? What if nothing is using the CancelScope in the caller stack? Is the great penalty avoidable if user is not actually using CancelScopes?

MarkusSintonen avatar Jul 23 '24 17:07 MarkusSintonen

No, the checkpoint_if_cancelled() call handles CancelScope support. The cancel_shielded_checkpoint() there prevents the busy-wait loop which will happen if you run the above loop with asyncio primitives, as they weren't designed with the right principles.

agronholm avatar Jul 23 '24 17:07 agronholm

To summarize, asyncio's incorrect design makes it faster here.

agronholm avatar Jul 23 '24 17:07 agronholm

Ok I see thanks for the explanation! So in essence anyio.Lock can be used to work around the issue (do Python core devs know about it? :D). httpcore doesnt utilize the _synchronization.AsyncLock in such ways so it seems its safe to replace the anyio.Lock with asyncio.Lock to not pay the penalty.

MarkusSintonen avatar Jul 23 '24 17:07 MarkusSintonen

Ok I see thanks for the explanation! So in essence anyio.Lock can be used to work around the issue (do Python core devs know about it? :D). httpcore doesnt utilize the _synchronization.AsyncLock in such ways so it seems its safe to replace the anyio.Lock with asyncio.Lock to not pay the penalty.

Only in the asyncio backend. In the AnyIO backend, you would need to use AnyIO primitives, lest you render the whole backend pointless.

agronholm avatar Jul 23 '24 17:07 agronholm

So in essence anyio.Lock can be used to work around the issue (do Python core devs know about it? :D).

Asyncio has so many design issues (the ill-fated uncancellation mechanism being that latest example) that at this point it's beyond saving. Even GvR (who's the original author of asyncio) told me to my face that he'd prefer people start using something else.

agronholm avatar Jul 23 '24 18:07 agronholm

What is the status of this PR? Should it be reviewed+merged, closed, or something else?

rattrayalex avatar Aug 09 '24 06:08 rattrayalex

What is the status of this PR? Should it be reviewed+merged, closed, or something else?

Perhaps we should see how https://github.com/agronholm/anyio/pull/761 affects the real-world performance before jumping the gun?

agronholm avatar Sep 01 '24 10:09 agronholm

What is the status of this PR? Should it be reviewed+merged, closed, or something else?

Perhaps we should see how agronholm/anyio#761 affects the real-world performance before jumping the gun?

Also posting here, I rerun the benchmarks with above fix from AnyIO using fast_acquire=True for Lock (and Semaphore) in httpcore. The results look pretty good with AnyIO fix and fast_acquire=True usage.

First results with all optimization PRs applied (to not count other httpcore issues in).

With fast_acquire=False: Näyttökuva 2024-09-02 kello 20 15 55

With fast_acquire=True: Näyttökuva 2024-09-02 kello 20 16 46

Lastly results without all optimization PRs applied (how it would look against httpcore master, with its other issues).

With fast_acquire=False: Näyttökuva 2024-09-02 kello 20 22 11

With fast_acquire=True: Näyttökuva 2024-09-02 kello 20 20 50

MarkusSintonen avatar Sep 02 '24 17:09 MarkusSintonen

Should we keep AnyIO but instead use the fast_acquire=True mode in _synchronization.py? Cc @tomchristie

MarkusSintonen avatar Sep 02 '24 17:09 MarkusSintonen

FYI, AnyIO 4.5.0 is out now with the fast_acquire option.

agronholm avatar Sep 19 '24 09:09 agronholm

Should we keep AnyIO but instead use the fast_acquire=True mode in _synchronization.py?

Exercise caution. What does the pull request for that look like? That'll be a useful highlighter.

lovelydinosaur avatar Sep 19 '24 10:09 lovelydinosaur