Improve retry robustness
Background
In send_notification_batch, we start with self.connect() to self recover from potential connection related errors: https://github.com/Pr0Ger/PyAPNs2/blob/5e4a938cd2e24249f6345aac27ba783defa5a63f/apns2/client.py#L187-L189
However, it is not enough, because the self recovery is done through:
APNsClient.connect() -> APNsClient._connection.connect() -> https://github.com/python-hyper/hyper/blob/b77e758f472f00b098481e3aa8651b0808524d84/hyper/http20/connection.py#L331-L347, which is basically:
def connect():
with self._lock:
if self._sock is not None:
return
# ... (omitted, the code proceeds to rebuild the connection otherwise)
In the cases where self._sock is in a weird state, this call sequence does not help with resetting the connection states, as self._sock does still hold a value despite no longer being able to actually doing meaningful work. As a result, after calling APNsClient.connect(), we still end up using a broken socket. And that scenario is not recoverable.
What we have observed in production was that the send_notification_batch() call just retries 3 times then raises ConnectionFailed 💀 .
Approach
This PR is an attempt to fix that, by forcing a full connection reset from the second retry onwards. We are currently running with a hack in production, by basically doing:
try:
results = client.send_notification_batch(notifications, topic=bundle_id)
except ConnectionFailed:
client._connection.close() # the hack
retry(...)
And this did work for us. The unrecoverable ConnectionFailed never happened to us after this was deployed.
Some thoughts
I still left the first try to be a "soft" connection reboot so that some other scenario might recover/reset faster, and also that keeps the first retry the same as the current code. But for sure we can change that to be a full reconnection too if you like 😃