websockets question: intended behavior with failed auth and async for auto-reconnect

I'm writing a websocket client and having trouble reconciling resiliency and authentication error handling. I'm using HTTP basic auth in my backend, sending the auth details as an extra header when connecting. When the auth data is invalid, my backend responds to the connection request with an http 401

I started with this form, as per the FAQ https://websockets.readthedocs.io/en/stable/faq/client.html#how-do-i-reconnect-when-the-connection-drops

async for websocket in websockets.connect( endpoint, extra_headers = headers ):
    try:
        await do_stuff( websocket )
    except asyncio.exceptions.CancelledError:
        break
    except websockets.ConnectionClosed:
        continue

I found that when the auth was invalid, it would loop indefinitely retrying. Obviously, being a 4xx class error, it will never succeed on a subsequent retry.

So, I switched to this https://github.com/python-websockets/websockets/issues/1231

try:
    async with websockets.connect( endpoint, extra_headers = headers ) as websocket:
        try:
            await do_stuff( websocket )
        except asyncio.exceptions.CancelledError:
            pass
        except websockets.ConnectionClosed:
            pass
except websockets.exceptions.InvalidStatusCode as err:
    print( err )
    sys.exit( 1 )

This behaves as I expect in both invalid and valid auth use cases.

How am I supposed to be using the library such that it correctly handles invalid auth without retries and also has resiliency against lost connections?

Jul 14 '24 20:07 francisferrell

In short, you're expecting websockets to tell retriable errors apart from non-retriable errors here

If memory serves, when I implemented the automatic reconnection feature, I decided to walk away from trying to make that distinction automatically because I couldn't draw confidently a line between retriable and non-retriable that would work for all users.

If you had to draw that line, what criteria would you use? Apart from the specific example here (HTTP 403 isn't retriable)?

Jul 16 '24 12:07 aaugustin

In my particular case, it's a 401 used when authentication credentials are missing or invalid. I would deem this to be a non-retriable situation. That's pretty clear cut to me personally, the user's password or API token is unlikely to suddenly become valid. My use case doesn't include 403 so I hesitate to comment.

As an application developer using this library, my desire is that such an auto reconnect feature makes my app substantially more resilient again things I'm not in control of. That primarily means connection timed out, connection refused, and many maybe even all 5xx class response codes. Meanwhile, 4xx is largely within my control and if I get one then it means I need to change something about how I'm connecting before the next attempt.

As a library developer, if I were in your position, I'm not sure how I'd draw the line for all possible http response codes. No answer is likely to work for all users. That being said, I am drawn to the fact that different 3xx codes have explicit semantics around retrying, e.g. temporary vs permanent redirect. 4xx feels like a range of codes that are not retriable. I'm not sure if the spec actually backs up my feeling; but then the spec will only take one so far, as real browser and server implementations deviate from the spec.

Jul 17 '24 20:07 francisferrell

Ultimately I decided to implement auto-reconnect myself with a service layer presenting a run_forever() entrypoint with exponential backoff in the reconnect logic.

I still think that, in theory, async for websockets.connect() could benefit from understanding what types of http responses to the Upgrade request are retryable and which are not. I also understand that without concrete user interest in that feature it's reasonable to close a ticket like this. Thanks for your work providing this excellent library!

Aug 09 '24 00:08 francisferrell

I also understand that without concrete user interest in that feature it's reasonable to close a ticket like this.

I also have great interest in this feature, for the exact same use case @francisferrell described. I consider this issue solved for me when my application can give the user the option to re-enter credentials on an Unauthorized 401 while still be resilient to connection timeouts and such.

It also might be relatively good timing for this issue since __aiter__ isn't implemented in the new class connect yet. @aaugustin Would PRs be welcome?

Aug 21 '24 11:08 jbdyn

At this point I'm more interested in designing a good API than in a PR.

I see two main options:

we make the decision of what is a retriable or non-retriable error (tough luck for users who have a different definition)
we let the user pass a callable that decides whether an exception is a retriable error (with a sane default — I'm still leaning towards "everything is retriable" because even 401 / 403 errors could be fixed by tuning authentication / authorization configuration on the server, or they could be returned by a proxy because you're on a wifi with a captive portal where you haven't logged in yet, etc.)

Aug 21 '24 21:08 aaugustin

What API would you suggest and why?

Aug 21 '24 21:08 aaugustin

I see two main options:

I would go for option 2 for now.

What API would you suggest and why?

(The following is all untested.)

How about that:

The user can give the keyword arguments cancel_exceptions: Iterable[WebsocketException] = None and cancel_callable: Callable | Awaitable = None to websockets.connect(...)
The main loop in websockets.connect would then be written like:

# src/websockets/connect.py

# __aiter__

# UNTESTED: `exceptions` cannot be empty, and we also
# cannot implement an `except: ...` branch conditionally, so
# we need a `NoOpException` here to fill the gap
exceptions = tuple(cancel_exceptions) or (NoOpException,) 

try:
    async with self as protocol:
        yield protocol
except exceptions as exc:
    if cancel_callable is not None:
        result = cancel_callable(exc) # we can check for a coroutine here
        if result is None:
            calculate_backoff_time()
            continue # cancel_callable fixed things or just informed
        else:
            break
    else:
        raise exc
except Exception:
    calculate_backoff_time()
    continue
else:
    reset_backoff_time()

Or, still with the API above, the user just gives the keyword argument cancel_exceptions: Iterable[WebsocketException] = None to websockets.connect(...) and then needs to write an own try: ... except: ... block around async for websockets.connect(...) like so:

# path/to/user/code.py

cancel_exceptions = (InvalidStatus, InvalidURL)

try:
    async for websocket in websockets.connect(cancel_exceptions=cancel_exceptions):
        do_stuff(websocket)
except cancel_exceptions as exc:
    result = cancel_callable(exc)
    if result is None:
        continue
    else:
        break

The cancel_callable could then in both ways be written with functools.singledispatch like so:

# path/to/user/code.py

from functools import singledispatch

@singledispatch
def cancel_callable(exc):
    print("generic cancel callable")

@cancel_callable.register(InvalidStatus):
def _(exc):
    print(f"invalid status with {exc.status_code}")

@cancel_callable.register(InvalidURL):
def _(exc):
    url = input(f"please fix {url}: ")

Otherwise, one could also address the exceptions in individual except: ... branches without the need for functools.singledispatch. However, this only applies to the user-defined code, and not to websockets.connect.

In the end, we just have a small but still useful addition to the API while the user can gain full control over exceptions from async for websocket in websockets.connect(...). Also, with the usage of functools.singledispatch or functools.singledispatchmethod, we have a nice recipe for a flexible dispatching pattern.

Aug 22 '24 08:08 jbdyn

Thank you for the suggestion. That's useful.

I'd like to keep the API and implementation as simple as possible. Ideally, I'd like only one new argument, when your first proposal includes two of them. The second proposal has only one; I prefer that.

The original use case was about not retrying for some HTTP error codes that are deemed unrecoverable. All of them result in InvalidStatus. Unfortunately, this makes it impossible to discriminate between them with cancel_exceptions as proposed above.

Here's a third proposal:

    async def __aiter__(self) -> AsyncIterator[WebSocketClientProtocol]:
        ...
        while True:
            try:
                async with self as protocol:
                    yield protocol
            except Exception as exc:

                # START NEW CODE
                if self.is_exception_fatal(exc):
                    break
                # END NEW CODE

                ...  # apply exponential backoff here...
                continue

It's similar to your second proposal in the sense that the cancel_callable logic that you have in mind would run outside of websockets.

However, it's more customizable, in the sense that you can implement arbitrary logic in is_exception_fatal — and you can get fancy with functools.singledispatch if you'd like :-)

@singledispatch
def is_exception_fatal(exc):
    return False

@is_exception_fatal.register(InvalidStatus):
def _(exc):
    # Client errors need to be fixed by the client. They aren't retried automatically.
    return 400 <= exc.response.status_code < 500

Still TBD:

whether the semantics are is_exception_fatal or is_exception_retriable — no argument to favor one over the other comes to mind immediately
what's the best name — maybe there are better names that those in 1.

Aug 22 '24 13:08 aaugustin

Do you think that this proposal would support your use case?

Aug 22 '24 13:08 aaugustin

I'm also growing sympathetic to considering HTTP 4xx as fatal by default, as they're defined as "client errors", implying that the client needs to be changed before retrying.

Aug 22 '24 13:08 aaugustin

Looking at all HTTP error codes, actually, very few are worth retrying.

Redirects (301, 302, 303, 307, 308) should be followed.
Server crashed / down / unreachable (500, 502, 503, 504) should be retried.

For all others HTTP codes, repeating the request seems unlikely to result in a different response.

Aug 22 '24 13:08 aaugustin

So my proposal is to retry:

all I/O errors, timeouts, invalid responses, etc.
if the HTTP request gets a valid HTTP response, only HTTP 500, 502, 503, 504
all unknown exceptions

Aug 22 '24 13:08 aaugustin

Do you think that this proposal would support your use case?

Yes, the logic definitely would. I like it. :+1: With that callable, I have the options to either build (some) functionality right into the connection loop or I just raise exceptions in there and deal with them outside.

I wonder if it should be

# START NEW CODE
if self.is_exception_fatal(exc):
    raise exc # instead of `break`
# END NEW CODE

or just

# START NEW CODE
self.is_exception_fatal(exc)
# END NEW CODE

without even bothering what might be returned. What's your opinion on that?

Still TBD:

whether the semantics are is_exception_fatal or is_exception_retriable — no argument to favor one over the other comes to mind immediately

what's the best name — maybe there are better names that those in 1.

What do you think about process_exception or process_exceptions?
It would be consistent with process_request and process_response of websockets.asyncio.server.serve and suggests a callable right away. Additionally, it does not decide whether a callable should be fatal or retriable and leaves that to the user.

Looking at all HTTP error codes, actually, very few are worth retrying.

Redirects (301, 302, 303, 307, 308) should be followed.

Server crashed / down / unreachable (500, 502, 503, 504) should be retried.

For all others HTTP codes, repeating the request seems unlikely to result in a different response.

This makes sense to me.

For the record, here is a source: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status

So my proposal is to retry:

all I/O errors, timeouts, invalid responses, etc.

if the HTTP request gets a valid HTTP response, only HTTP 500, 502, 503, 504

all unknown exceptions

Looks, good. I wonder if unknown exceptions should be raised instead, since they might be the cause of a user implementation or library bug? :thinking:

Aug 23 '24 08:08 jbdyn

self.is_exception_fatal(exc)

Code changes are getting pretty minimal at this point ;-)

What do you think about process_exception?

I like it!

If we're just adding:

self.process_exception(exc)

then the semantics of process_exception must be "if you consider the exception fatal, re-raise it, else return None". Then I'm wondering if there's a risk that users will return the exception (or another exception) instead of raising it.

If we wanted the API to be "return the exception or None", which would maximize similarity with process_request and process_response (thus respecting the principle of least astonishment), then the code would look like:

try:
    connect(...)
except Exception as exc:
    new_exc = self.process_exception(exc)
    if new_exc is exc:
        raise
    elif new_exc is not None:
        raise new_exc

which looks like the best option to me.

I wonder if unknown exceptions should be raised instead

I've been wondering the same yesterday after submitting my comment. I'm now leaning towards retrying:

OSError and TimeoutError
HTTP 500, 502, 503, 504

which are clearly in the "must be retried" category. Then, if we get complaints about errors that should be retried and aren't, we can add them. Since we're just making a TCP connection, there's only so many ways it can fail.

SSLError is a subclass of OSError. Most SSL errors should be retried; arguably SSLCertVerificationError isn't really useful to retry; this is where we start walking down the rabbit hole of "what is retriable and what isn't"... I'm going to draw the line at "let's retry every OSError".

If the connection is closed before we get a full HTTP response, including when we don't get a single byte of response, websockets raises TimeoutError after open_timeout. Retrying is the correct behavior.

The other exceptions that connect() is documented to raise are:

InvalidURI: that one ain't gonna fix itself; not retrying is OK.
InvalidHandshake: we said that we wanted to retry HTTP 500, 502, 503, 504 and abort everything else.

Aug 23 '24 13:08 aaugustin

Code changes are getting pretty minimal at this point ;-)

My favorite :grin:

[...] the semantics of process_exception [...]

I definitely see your point. Personally, I barely wrote code returning an exception, but I raised a ton of them. Hence, the first option feels much more natural to me than the second, but this is just me. One could call it then on_exception instead of process_exception. That being said, I am also completely fine with your favorite. :+1:

While thinking about this, something else crossed my mind: process_exception could return a dictionary with updated keyword arguments for the new websockets.connect initialization. :eyes: Wouldn't that be nice? But no minimalism anymore...

I'm now leaning towards retrying:

OSError and TimeoutError

HTTP 500, 502, 503, 504

I think you should go for that. And IMHO, at this point it is far better to just try it, rely on feedback from the community and reiterate on that instead of going into the rabbit hole.

Aug 23 '24 13:08 jbdyn

While thinking about this, something else crossed my mind: process_exception could return a dictionary with updated keyword arguments for the new websockets.connect initialization. 👀 Wouldn't that be nice? But no minimalism anymore...

The longer I think about this idea, the more I dislike it. One can just use process_exception for that: break out of the loop, update the parameters according to the reason and re-initialize the loop. Done.

Aug 23 '24 17:08 jbdyn

Agreed, that would be too complicated.

Aug 23 '24 19:08 aaugustin

On the semantics of process_exception — you have a valid point — returning en exception is less natural than raising it.

The counter point is — if the API is "return an exception or None" and a user raises the exception instead of returning it, they will still get the right behavior. I don't see that code changing much in the future so this accidental behavior is likely to continue working.

Conversely, if the API was "raise an exception or return None", returning something other than None would be treated the same as returning None (or require a dedicated error handling path).

Aug 24 '24 07:08 aaugustin

To be honest, I am a bit puzzled right now 🤔

If we wanted the API to be "return the exception or None", which would maximize similarity with process_request and process_response (thus respecting the principle of least astonishment) [...]

This is the only real benefit of option 2 I am able to see.

[...] and a user raises the exception instead of returning it, they will still get the right behavior.

Here is the decision between "the user only raises the exception" and "the user raises the exception or returns the exception to websockets and websockets raises it", which neither changes the user's effort ("shall I write return or raise?") nor the outcome (an exception is raised) and with:

Conversely, if the API was "raise an exception or return None", returning something other than None would be treated the same as returning None (or require a dedicated error handling path)

I consider option 1 to be enough without the need for error handling (we simply ignore anything returned). With it comes also no return value checking and thus less code to maintain (to be fair, not by much, but still).

[For option 1], I'm wondering if there's a risk that users will return the exception (or another exception) instead of raising it.

I know what you mean, but don't see that as a big concern. I claim, most people raise exceptions anyway (though reality might prove me wrong here).

I would sum it up this way:

process_exception(exc: Exception) -> None (option 1, my preference): one way with raise, less code
process_exception(exc: Exception) -> Exception | None (option 2, your preference): two ways with return and raise and thereby in line with process_request and process_response, alongside with return value handling.

If I am mistaken or missing something, please tell me.

Nonetheless, if you as the maintainer feel like option 2 suits websockets better, I totally see that. You decide :relaxed:

Aug 24 '24 12:08 jbdyn

The PR is missing tests. I did some basic manual testing and it appears to work.

If I am mistaken or missing something, please tell me.

One thing I thought about, but didn't mention -- I care about what tracebacks look like.

I experimented quickly to check what raise exc from new_exc produces when new_exc is exc; the same exception is shown at two lines; that's how I decided that I needed a special code path for that case.

Option 2 makes it possible to just raise. I'm sure that this gives a clean traceback with the exception at the point where it originated. None of the exception handling appears.

I believe that raise exc in option 1 will make the traceback less clean, with the exception shown in process_exception instead. (Didn't try it.)

Overall there's pros and cons to both options. I'll think about it a bit more. Probably writing the tests will help me get a better feel for what the code looks like and make the final decision.

Aug 24 '24 13:08 aaugustin

The PR is missing tests. I did some basic manual testing and it appears to work.

I am delighted to see your PR! :tada:

One thing I thought about, but didn't mention -- I care about what tracebacks look like.

Ah, yes, that could be an issue. Thanks for pointing out. I am curious what you will find.

Overall there's pros and cons to both options. I'll think about it a bit more. Probably writing the tests will help me get a better feel for what the code looks like and make the final decision.

Let me know if there is anything I can assist you with. :slightly_smiling_face:

Aug 24 '24 14:08 jbdyn

About the API of process_request — I decided to support both forms (return and raise).

Indeed, the natural patterns are:

return exc to keep the original exception and make it fatal — this will be the majority use case
raise NewException(...) to change the exception — honestly I don't know the use case for this but it seemed important for you so I included this feature

The documented form remains return to match the majority use case.

Aug 27 '24 19:08 aaugustin

I added tests to #1491. Still two things to do before merging:

tests hang under Python 3.8 😡
I'm not happy with how the logger is set up (I need it even when I don't have a connection so I cannot use the connection's logger)

Aug 27 '24 20:08 aaugustin

Tests were using HTTP 418, which was added in Python 3.9 🤦‍♂️

Aug 28 '24 19:08 aaugustin

About the API of process_request — I decided to support both forms (return and raise).

Thank you so much for your effort!

Sep 02 '24 07:09 jbdyn

I wanted to update my local websockets package, but I noticed that the changes are not on PyPI yet. Without any pressure: May I ask which plans you have for publishing?

Sep 02 '24 09:09 jbdyn

You can install directly the latest commit of the main branch -- see instructions for various situations in this blog post: https://adamj.eu/tech/2019/03/11/pip-install-from-a-git-repository/

I'd like to bring the new asyncio implementation to feature parity with the old one before making the next release. The only missing feature at this point is "following redirects in connect()".

Hopefully, I can get that done in September, although some factors are outside of my direct control (work, family...) ;-)

Sep 03 '24 07:09 aaugustin

Sure! Thanks for the explanation and the outlook. I'm looking forward to it. :relaxed:

Sep 03 '24 14:09 jbdyn

Now available in version 13.1, which I just released.

Sep 21 '24 17:09 aaugustin