aiohttp icon indicating copy to clipboard operation
aiohttp copied to clipboard

Remove _next_expiration field from CookieJar

Open rgeronimi opened this issue 2 years ago • 17 comments

Is your feature request related to a problem?

The _next_expiration field seems unused in CookieJar. It is only written to and never read in any part of aiohttp:

https://github.com/aio-libs/aiohttp/blob/a57dc3146839e2e1978b96db48e1de3af1a2bb50/aiohttp/cookiejar.py#L84

Describe the solution you'd like

Remove the field if it's useless

Describe alternatives you've considered

Maybe that it was intended for something that was never completely coded (e.g.,fast removal of expired cookies).

Related component

Client

Additional context

No response

Code of Conduct

  • [X] I agree to follow the aio-libs Code of Conduct

rgeronimi avatar Nov 05 '23 09:11 rgeronimi

Indeed this _next_expiration field was used to perform fast removal of expired cookies, but a posterior commits deleted the line of code that was using it: https://github.com/aio-libs/aiohttp/blob/d65f5cb4751651c7c43ac3744c92cbd4b23eec04/aiohttp/cookiejar.py#L87

Since that line of code has been deleted, _next_expiration is useless now (and the cleaning of cookies is slower as it doesn't use it to cut short the clear process).

The commit that deleted this line of code is https://github.com/aio-libs/aiohttp/pull/5249 by @derlih

As an alternative to removing _next_expiration entirely, this line of code could be reintroduced. However this is more difficult now, as this commit made the _do_expiration code to be removed, and call instead the clear method, which now serves additional needs, like allowing custom selection of cookies to be cleared through a callback function. To revert this, the old _do_expiration code shall be reinstated, instead of delegating the work to the clear method.

More largely, frequent execution paths (_do_expiration is called very often across the entire API) shall have simple code benefiting from shortcuts (like the _next_expiration field to determine if the expiration process can be skipped) and rare paths (like a clear operation with a custom selection function) could be isolated into dedicated code.

rgeronimi avatar Nov 05 '23 09:11 rgeronimi

#7784 is also looking at removing duplicate calls to _do_expiration(). @bdraco Any opinion on whether this is worth reimplementing, or if it becomes a bit redundant after your changes?

Dreamsorcerer avatar Nov 06 '23 13:11 Dreamsorcerer

Ideally the _next_expiration functionality would be restored.

In Home Assistant we see a lot of time spent in filter_cookies because sessions are shared between integrations which is aggravated by _do_expiration being called multiple times currently. Home Assistant tends to have a lot of cookies built up in the cookie jar which would make the restoration of the _next_expiration functionality quite helpful for performance since most calls to _do_expiration would return almost immediately.

bdraco avatar Nov 06 '23 13:11 bdraco

Same here - I use aiohttp for a large-scale crawler. It works wonder, excepted the time it spends processing the cookie jar. The code seems complex and, potentially, not optimized for the frequent case (expiring cookies before selecting the set of active cookies to send to a website) but for rare cases (clearing cookies based on a custom logic).

rgeronimi avatar Nov 06 '23 14:11 rgeronimi

py-spy of a request. Purple area is the time spent filtering cookies

Screenshot 2023-11-06 at 8 08 19 AM

Zoomed in Screenshot 2023-11-06 at 8 08 57 AM

bdraco avatar Nov 06 '23 14:11 bdraco

How many cookies were in the jar? I have several thousands (across a couple hundred domains)

rgeronimi avatar Nov 06 '23 14:11 rgeronimi

I don't have a good way to inspect that as its from a live running instance, but I can clearly say no where near that many. I'd ballpark it to be 50-100 at most. I expect for your use case you probably spend the bulk of the request time in the cookie jar.

bdraco avatar Nov 06 '23 14:11 bdraco

Yep that's what I see. The cookie jar is not scalable, specifically due to the expiration logic. I didn't see this through py-spy but when switching the crawler from a DummyCookieJar to the standard CookieJar

rgeronimi avatar Nov 06 '23 14:11 rgeronimi

Sounds like reintroducing it is the correct approach then.

Dreamsorcerer avatar Nov 06 '23 14:11 Dreamsorcerer

In fact the entire logic could be switched in favor of a time-ordered expiry queue. That way, there is never any need to loop over every cookie. Only the ones in front of the queue (the ones due to expire soon) need to be checked. The loop is exited as soon as the next iterated cookie from that queue is not expired yet. The heapq standard library provides the priority queue algorithm for free.

rgeronimi avatar Nov 06 '23 14:11 rgeronimi

Well, feel free to come up with a proposal, if it's clearly better without too much complexity, then we can merge it. Just bear in mind a 3rd PR currently in progress, that affects cookie performance is #7777.

Dreamsorcerer avatar Nov 06 '23 14:11 Dreamsorcerer

Sounds great, I will wait for #7777 to be merged and then, depending on if it resolved this issue or not and the new capabilities it offers, make a proposal.

rgeronimi avatar Nov 06 '23 14:11 rgeronimi

It doesn't change the expirations, but will reduce the number of cookies being looked at in filter_cookies(). Mainly just want to make sure you don't end up with a load of merge conflicts.

Dreamsorcerer avatar Nov 06 '23 14:11 Dreamsorcerer

I understand, I will find the best path.

rgeronimi avatar Nov 06 '23 14:11 rgeronimi

In case its help to test against. Below is a basic benchmark (can likely be improved)

import timeit
from http import cookies

from yarl import URL

from aiohttp import CookieJar


def filter_large_cookie_jar():
    """Filter out large cookies from the cookie jar."""
    jar = CookieJar()
    c = cookies.SimpleCookie()
    domain_url = URL("http://maxagetest.com/")
    other_url = URL("http://otherurl.com/")

    for i in range(5000):
        cookie_name = f"max-age-cookie{i}"
        c[cookie_name] = "any"
        c[cookie_name]["max-age"] = 60
        c[cookie_name]["domain"] = "maxagetest.com"
    jar.update_cookies(c, domain_url)
    assert len(jar) == 5000
    assert len(jar.filter_cookies(domain_url)) == 5000
    assert len(jar.filter_cookies(other_url)) == 0

    filter_domain = timeit.timeit(lambda: jar.filter_cookies(domain_url), number=1000)
    filter_other_domain = timeit.timeit(
        lambda: jar.filter_cookies(other_url), number=1000
    )
    print(f"filter_domain: {filter_domain}")
    print(f"filter_other_domain: {filter_other_domain}")


filter_large_cookie_jar()

bdraco avatar Nov 12 '23 22:11 bdraco

This is a great benchmark - will use it and publish sample measures for future reference

rgeronimi avatar Nov 16 '23 12:11 rgeronimi

Sounds great, I will wait for #7777 to be merged and then, depending on if it resolved this issue or not and the new capabilities it offers, make a proposal.

PR is merged now, so feel free to propose further changes.

Dreamsorcerer avatar Jan 21 '24 00:01 Dreamsorcerer