Improve retry robustness

Open blakegong opened this issue 4 years ago • 0 comments

Background

In send_notification_batch, we start with self.connect() to self recover from potential connection related errors: https://github.com/Pr0Ger/PyAPNs2/blob/5e4a938cd2e24249f6345aac27ba783defa5a63f/apns2/client.py#L187-L189

However, it is not enough, because the self recovery is done through: APNsClient.connect() -> APNsClient._connection.connect() -> https://github.com/python-hyper/hyper/blob/b77e758f472f00b098481e3aa8651b0808524d84/hyper/http20/connection.py#L331-L347, which is basically:

def connect():
    with self._lock:
        if self._sock is not None:
            return

        # ... (omitted, the code proceeds to rebuild the connection otherwise)

In the cases where self._sock is in a weird state, this call sequence does not help with resetting the connection states, as self._sock does still hold a value despite no longer being able to actually doing meaningful work. As a result, after calling APNsClient.connect(), we still end up using a broken socket. And that scenario is not recoverable.

What we have observed in production was that the send_notification_batch() call just retries 3 times then raises ConnectionFailed 💀 .

Approach

This PR is an attempt to fix that, by forcing a full connection reset from the second retry onwards. We are currently running with a hack in production, by basically doing:

try:
    results = client.send_notification_batch(notifications, topic=bundle_id)
except ConnectionFailed:
    client._connection.close()  # the hack
    retry(...)

And this did work for us. The unrecoverable ConnectionFailed never happened to us after this was deployed.

Some thoughts

I still left the first try to be a "soft" connection reboot so that some other scenario might recover/reset faster, and also that keeps the first retry the same as the current code. But for sure we can change that to be a full reconnection too if you like 😃

Apr 26 '21 15:04 blakegong