Failed to receive valid reponse after 3 retries
Hi All,
Been using customerio-python for a fair bit in production environment. From time-time-time we would get the following error: Cannot send email for xxxxx message id xx ** Failed to receive valid response after 3 retries.** Check system status at http://status.customer.io.
The same payload/user would go through at another instance.
Python 3.8
Has anyone encountered this before?
Would a simple remedy be adding a retry wrapper on top?
Hello, friend. I have the exact same problem that you described.
Since December 22, from time to time, I receive this same message. Were you able to fix this problem or it just stopped?
Thanks for your time.
We're seeing this issue too! I've been asking Customer.io support about it.
I thought it might be a thread safety issue, as we are running multithreaded environment and the CIO python client shares an instance of requests.Session and I have seen concerns elsewhere about threading issues with this configuration, but I subclassed the client to store the session in a thread local var, and that did not help, so I suspect the problem is on the server.
Also I noticed that only transactional email has this problem - track, identify and add_device never experience connection issues for us (transactional email seems to be a separate API with a different hostname).
Same here, the complete error message would be this , at least for me.
Last caught exception -- <class 'requests.exceptions.ConnectionError'>: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))
Check system status at http://status.customer.io.
customerio.client_base.CustomerIOException: Failed to receive valid reponse after 3 retries.
raise CustomerIOException(message)
File "/usr/local/lib/python3.9/site-packages/customerio/client_base.py", line 40, in send_request
resp = self.send_request('POST', self.url + "/v1/send/email", request)
...
During handling of the above exception, another exception occurred:
requests.exceptions.ConnectionError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))
raise ConnectionError(err, request=request)
File "/usr/local/lib/python3.9/site-packages/requests/adapters.py", line 498, in send
r = adapter.send(request, **kwargs)
File "/usr/local/lib/python3.9/site-packages/requests/sessions.py", line 655, in send
resp = self.send(prep, **send_kwargs)
File "/usr/local/lib/python3.9/site-packages/requests/sessions.py", line 542, in request
response = self.http.request(
File "/usr/local/lib/python3.9/site-packages/customerio/client_base.py", line 31, in send_request
Traceback (most recent call last):
During handling of the above exception, another exception occurred:
urllib3.exceptions.ProtocolError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))
return self._sslobj.read(len, buffer)
File "/usr/local/lib/python3.9/ssl.py", line 1100, in read
return self.read(nbytes, buffer)
File "/usr/local/lib/python3.9/ssl.py", line 1242, in recv_into
return self._sock.recv_into(b)
File "/usr/local/lib/python3.9/socket.py", line 704, in readinto
line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
File "/usr/local/lib/python3.9/http/client.py", line 281, in _read_status
version, status, reason = self._read_status()
File "/usr/local/lib/python3.9/http/client.py", line 320, in begin
response.begin()
File "/usr/local/lib/python3.9/http/client.py", line 1377, in getresponse
httplib_response = conn.getresponse()
File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 440, in _make_request
File "<string>", line 3, in raise_from
six.raise_from(e, None)
File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 445, in _make_request
httplib_response = self._make_request(
File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 699, in urlopen
raise value.with_traceback(tb)
File "/usr/local/lib/python3.9/site-packages/urllib3/packages/six.py", line 769, in reraise
raise six.reraise(type(error), error, _stacktrace)
File "/usr/local/lib/python3.9/site-packages/urllib3/util/retry.py", line 532, in increment
retries = retries.increment(
File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 755, in urlopen
resp = conn.urlopen(
File "/usr/local/lib/python3.9/site-packages/requests/adapters.py", line 439, in send
Traceback (most recent call last):
During handling of the above exception, another exception occurred:
ConnectionResetError: [Errno 104] Connection reset by peer
return self._sslobj.read(len, buffer)
File "/usr/local/lib/python3.9/ssl.py", line 1100, in read
return self.read(nbytes, buffer)
File "/usr/local/lib/python3.9/ssl.py", line 1242, in recv_into
return self._sock.recv_into(b)
File "/usr/local/lib/python3.9/socket.py", line 704, in readinto
line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
File "/usr/local/lib/python3.9/http/client.py", line 281, in _read_status
version, status, reason = self._read_status()
File "/usr/local/lib/python3.9/http/client.py", line 320, in begin
response.begin()
File "/usr/local/lib/python3.9/http/client.py", line 1377, in getresponse
httplib_response = conn.getresponse()
File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 440, in _make_request
File "<string>", line 3, in raise_from
six.raise_from(e, None)
File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 445, in _make_request
httplib_response = self._make_request(
File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 699, in urlopen
We see this very regularly when using the Python SDK up to the point of it becoming unusable. This is what I believe is going on:
The APIClient defines retry parameters as follows:
def __init__(self, key, url=None, region=Regions.US, retries=3, timeout=10, backoff_factor=0.02, use_connection_pooling=True):
This is misleading since it gives the impression that API calls are being retried by default, and that these retries are configurable. However, they are not, since all calls in the client use HTTP POST methods, and the Retry() configuration by default only retries on idempotent methods. The result is that all calls to send_email() and send_push() are in fact only tried once (and not thrice as the default parameter value seems to suggest).
What's even more confusing is that the log message does print the (fake) number of retries:
Failed to receive valid response after 3 retries.
You can validate this by setting a large retries and backoff_factor and observing that the calls will not take any longer than they used to.
Judging from the presence of the parameters in the API client I assume the authors' intentions were to retry send_email() requests despite the non-idempotent side effects. Therefore my proposed fix would be to set Retry(..., allowed_methods=None) to retry on any verb.
If you'd like to work around this while waiting for an upstream fix, try if this works:
client = APIClient(
key=api_key,
region=Regions.EU,
retries=5,
backoff_factor=1.0,
)
def build_session_with_retries(self):
session = super(APIClient, self)._build_session()
session.headers["Authorization"] = "Bearer {key}".format(key=self.key)
# Retry request a number of times before raising an exception
# also define backoff_factor to delay each retry. Retry even on
# non-idempotent methods.
session.mount(
"https://",
HTTPAdapter(
max_retries=Retry(
total=self.retries,
backoff_factor=self.backoff_factor,
allowed_methods=None,
)
),
)
return session
client._build_session = types.MethodType(
build_session_with_retries, self.client
)
It replaces the original _build_session method.
Yes we came to the same conclusion about retrying POST requests. That cleared up the issue for us.
Unfortunately, emails are non-idempotent by nature, so retrying opens up the possibility of duplicate sends.
CIO support has not been very helpful here unfortunately, neither investigating the connection reset issue nor addressing the fact that retrying an API request that sends an email, without any kind of protection from duplicate sends, is a bad idea.
That said we've never had a customer complaint about duplicate emails, so could be it's just a theoretical concern.
I just came across the following notes in the README:
The Customer.io Python SDK depends on the Requests library which includes urllib3 as a transitive dependency. The Requests library leverages connection pooling defined in urllib3. urllib3 only attempts to retry invocations of HTTP methods which are understood to be idempotent (See: Retry.DEFAULT_ALLOWED_METHODS). Since the POST method is not considered to be idempotent, any invocations which require POST are not retried.
It is possible to have the Customer.io Python SDK effectively disable connection pooling by passing a named initialization parameter use_connection_pooling to either the APIClient class or CustomerIO class. Setting this parameter to False (default: True) causes the Session to be initialized and discarded after each request. If you are experiencing integration issues where the cause is reported as Connection Reset by Peer, this may correct the problem. It will, however, impose a slight performance penalty as the TCP connection set-up and tear-down will now occur for each request.
I conclude that they are aware of the issues, and have a feeling they will not merge my PR in that case. If the issue exists due to a combination of connection pooling and lack of retries, and disabling connection pooling is indeed an alternative, then I would propose CustomerIO devs start to take out the parameters from the APIClient since they literally do nothing there except confuse people, and only keep them for the CustomerIO client.
I wouldn't be surprised if this is due to a mismatch in urllib's default client side connection (pool ) timeout and CustomerIO's server side connection timeout, disconnecting earlier. That explains why we're seeing it only on low volume apps that keep running, and never on e.g. batch jobs.
For anyone else experiencing this issue: We were experiencing this issue for a while, and by taking advantage of the parameter (released in version 1.6) @skion references in the comment above and initializing with use_connection_pooling=False we were able to resolve it.
We are seeing the error pop up specifically when updating to urllib3 from 1.26.19 (works consistently basically all the time) to 2.5.0 (haven't tried 2.6.0 yet, but don't see anything obvious in the changelog). We did try the use_connection_pooling=False workaround in one place, but looks like we have to set it on APIClient() as well as CustomerIO(), which was a little non-intuitive.
To me, the fact that this happens so predictably with newer versions of urllib3, that IMO, it might be worth investigating this more and seeing if there's a way to either default to disabling connection pooling, or configuring it such that it works by default for most people's use cases. Happy to file a support ticket if that would help get priority on this.