recurly-client-python icon indicating copy to clipboard operation
recurly-client-python copied to clipboard

Intermittent NetworkError

Open phillipadsmith opened this issue 4 years ago β€’ 32 comments

Warning: Github issues are not an official support channel for Recurly. If you have an urgent request we suggest you contact us through an official channel: [email protected] or https://recurly.zendesk.com

Describe the bug

I've built a simple integration using this library and I am seeing an intermittent recurly.NetworkError exception with the message "Remote end closed connection without response"

This appears to be happening most often for this API call: client.list_accounts()

To Reproduce

You can see the call in context here: https://github.com/TheTyee/manage-account.thetyee.ca/blob/master/app.py#L56-L79

Expected behavior

For it to work consistently! :)

Your Environment

  • Which version of this library are you using? Recurly Python client 3.0

  • Which version of the language are you using? Python 3.8.2

phillipadsmith avatar Mar 24 '20 19:03 phillipadsmith

Thanks for the issue @phillipadsmith . We are currently working on making the clients more resilient and pushed out some changes in ruby and node this week to address this problem. The client should be able to gracefully handle this by retrying on "open" timeouts. Apologies for the problems this is causing. We can prioritize fixing python next.

bhelx avatar Mar 24 '20 20:03 bhelx

Any other details you are seeing the response? I want to make 100% sure we're gonna solve the right problem.

bhelx avatar Mar 24 '20 20:03 bhelx

Any other details you are seeing the response? I want to make 100% sure we're gonna solve the right problem.

Hi @bhelx -- thanks for the quick reply. That's all of the information I'm getting back from the exception. If you have a suggestion for how I can get a more verbose error message, I'd be happy to try it out.

phillipadsmith avatar Mar 24 '20 20:03 phillipadsmith

Sure, let me know if you have any more info. We'll keep moving forward with the expected fix in the mean time.

bhelx avatar Mar 24 '20 20:03 bhelx

@phillipadsmith what's the purpose of limit=1 here? https://github.com/TheTyee/manage-account.thetyee.ca/blob/master/app.py#L57

bhelx avatar Mar 24 '20 20:03 bhelx

If you're just trying to get the first / only account that matches, this will help you when it lands: https://github.com/recurly/recurly-client-python/pull/367

bhelx avatar Mar 24 '20 20:03 bhelx

@phillipadsmith what's the purpose of limit=1 here? https://github.com/TheTyee/manage-account.thetyee.ca/blob/master/app.py#L57

I was hoping it would shorten the response time to only request one record. And, I'm only searching for one record (and only one should ever be returned).

If you're just trying to get the first / only account that matches, this will help you when it lands: #367

Ah-ha. Yes, that's what I'm trying to do. I'll keep an eye out for that release.

phillipadsmith avatar Mar 24 '20 20:03 phillipadsmith

Great, just checking. That's an appropriate way to use it. There has been a little confusion over what limit means so i wanted to be sure.

bhelx avatar Mar 24 '20 20:03 bhelx

@bhelx What was the fix implemented in the Ruby and Node modules? Is there a similar tactic I could implement in my version of the module, or in the application itself, e.g., catching the exception and trying the request again some limited number of times?

phillipadsmith avatar Mar 24 '20 23:03 phillipadsmith

@bhelx And what is the root cause of the timeouts? Is it the Recurly API not responding? I've never experienced this timeout when using the API endpoints directly with Postman or CURL or similar.

phillipadsmith avatar Mar 24 '20 23:03 phillipadsmith

@phillipadsmith I can't say what the root cause is just yet. At scale though, you're gonna have some random TCP-level blips. Most of the time, it's something that can be recovered from. Some http libraries handle this for you under the hood and postman may be doing that as well. We are trying our best to keep these clients dependency free for security and maintainability reasons so we are implementing some of this ourselves. Here is the ruby change that went out: https://github.com/recurly/recurly-client-ruby/pull/568 There isn't a clean way to do this with what we have in master, but keep in mind that you can safely retry GET requests as many times as you want. So you could catch a network error, sleep a few milliseconds, and retry some number of times.

bhelx avatar Mar 24 '20 23:03 bhelx

Again sorry about this, I'll be working on implementing and testing this the next couple days.

bhelx avatar Mar 24 '20 23:03 bhelx

Again sorry about this, I'll be working on implementing and testing this the next couple days.

@bhelx No need to apologize. I appreciate the quick replies and it's helpful just to hear that I'm not crazy and that this is a known issue. :)

phillipadsmith avatar Mar 25 '20 00:03 phillipadsmith

Couple of quick questions, @bhelx:

At scale though, you're gonna have some random TCP-level blips. Most of the time, it's something that can be recovered from.

True enough, but I've only got this on a test server at the moment, so we're not running any traffic against this yet -- it's just me and one other other person testing, and the exception happens quite often (almost too often, IMHO, to be a blip -- almost like there's another issue at play).

Some http libraries handle this for you under the hood and postman may be doing that as well.

Yes, interestingly, I've never had this happen when running the app from Flask, and I've never seen it in Postman.

And it never happens immediately after I start the app with Gunicorn, only after some time has passed, perhaps hours. Then it happens very frequently and very consistently until I restart the application.

Is there anything about running this library in a pre-fork WSGI server that might be introducing this network error?

phillipadsmith avatar Mar 25 '20 00:03 phillipadsmith

it's just me and one other other person testing, and the exception happens quite often (almost too often, IMHO, to be a blip -- almost like there's another issue at play).

This is something I personally see more on the order of 1 out of ~10,000 requests. If you are able to reproduce, could you try to print out some more information from the underlying socket error? From the message you pasted, it sounds like a RemoteDisconnected error but there may be some more info in the message we can use to diagnose.

Is there anything about running this library in a pre-fork WSGI server that might be introducing this network error?

So your webserver is multiprocess and not multithreading right? It is possible I suppose. I'm not sure how python would handle forking an instance an Client with an already open connection: https://github.com/TheTyee/manage-account.thetyee.ca/blob/master/app.py#L44

Since forking is copy-on-write, my gut would tell me that you want to create the connection after you fork. You could test this theory by trying to create the client inside of the accounts_search method. This wouldn't be the recommended way for production because you want to re-use the connection where possible, but if the problem goes away you'll have more evidence that forking is causing the issue.

bhelx avatar Mar 25 '20 19:03 bhelx

@phillipadsmith i'm not sure if flask has a special way to deal with this, but also consider using a "post-fork" hook of some kind: https://docs.pylonsproject.org/projects/pyramid_cookbook/en/latest/deployment/forked_threaded_servers.html?highlight=sqlalchemy

Perhaps flask has a way of letting you run some code after a fork happens.

bhelx avatar Mar 25 '20 19:03 bhelx

Thanks for those thoughts, @bhelx, appreciated.

This is something I personally see more on the order of 1 out of ~10,000 requests.

Oh, yes, definitely able to get it 8/10 times right now.

could you try to print out some more information from the underlying socket error?

Is there an easy way to get more verbose output from the exception? My assumption was that it was coming from the Recurly Python client.

Many thanks in advance!

phillipadsmith avatar Mar 25 '20 19:03 phillipadsmith

Oh, yes, definitely able to get it 8/10 times right now.

Oh, yeah that suggests a different problem entirely. Although I think the fix I'm working on might help. But I think we need to address the root problem.

Is there an easy way to get more verbose output from the exception? My assumption was that it was coming from the Recurly Python client.

A stack trace would be helpful. Let it explode. When i evoke one it looks like this:

Traceback (most recent call last):
  File "./clients/python/recurly/base_client.py", line 42, in _make_request
    http_resp = self.__conn.request(method, path, body, headers=headers).getresponse()
  File "/usr/lib/python3.7/http/client.py", line 1244, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/lib/python3.7/http/client.py", line 1290, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.7/http/client.py", line 1239, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.7/http/client.py", line 1026, in _send_output
    self.send(msg)
  File "/usr/lib/python3.7/http/client.py", line 966, in send
    self.connect()
  File "/usr/lib/python3.7/http/client.py", line 1399, in connect
    super().connect()
  File "/usr/lib/python3.7/http/client.py", line 938, in connect
    (self.host,self.port), self.timeout, self.source_address)
  File "/usr/lib/python3.7/socket.py", line 707, in create_connection
    for res in getaddrinfo(host, port, 0, SOCK_STREAM):
  File "/usr/lib/python3.7/socket.py", line 748, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -2] Name or service not known

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "scripts/v2019-10-10/python/create_account.py", line 33, in <module>
    account = client.create_account(account_create)
  File "./clients/python/recurly/client.py", line 133, in create_account
    return self._make_request("POST", path, body, None)
  File "./clients/python/recurly/base_client.py", line 71, in _make_request
    raise NetworkError(e)
recurly.NetworkError: [Errno -2] Name or service not known

bhelx avatar Mar 25 '20 19:03 bhelx

πŸ‘Let me see if I can get a stack trace.

phillipadsmith avatar Mar 25 '20 19:03 phillipadsmith

I've imported traceback and I'm trying to get the error again 🀞

phillipadsmith avatar Mar 25 '20 21:03 phillipadsmith

@bhelx Hey, I'm not getting any output from traceback when an exception occurs: https://github.com/TheTyee/manage-account.thetyee.ca/blob/master/app.py#L79-L80

Any idea what I've got wrong here?

phillipadsmith avatar Mar 25 '20 22:03 phillipadsmith

I'm not really familiar with it's usage, what happens if you just remove the whole rescue block for network exceptions? that's what i did in my script. python dumps the stack trace for you.

bhelx avatar Mar 25 '20 22:03 bhelx

πŸ€¦β€β™‚ Didn't even think of that.

phillipadsmith avatar Mar 25 '20 22:03 phillipadsmith

@phillipadsmith did you look into my earlier comment about creating the client after fork? I'm reading some people recommend you use the before_first_request decorator to make sure that a database or http client connection is created in each fork:

  • https://stackoverflow.com/questions/43554355/why-is-cassandra-hanging-when-i-connect-to-it-via-gunicorn-and-flask
  • https://stackoverflow.com/questions/46683662/flask-post-fork-connection-to-cassandra

I'm not sure how to reproduce your problem so I can't test this assumption on your app.

bhelx avatar Mar 25 '20 23:03 bhelx

did you look into my earlier comment about creating the client after fork? I'm reading some people recommend you use the before_first_request decorator to make sure that a database or http client connection is created in each fork:

I will read up on those and give them a try also. I'm still trying to get a useful stack trace!

Thanks @bhelx, appreciated.

phillipadsmith avatar Mar 26 '20 01:03 phillipadsmith

did you look into my earlier comment about creating the client after fork? I'm reading some people recommend you use the before_first_request decorator to make sure that a database or http client connection is created in each fork:

Thanks for you help, @bhelx.

I've made an update to the app so that Gunicorn will receive a post_fork hook that creates a Recurly client.

I haven't tested extensively, but I'm currently not able to trigger the error. I'll keep you updated.

phillipadsmith avatar Apr 01 '20 07:04 phillipadsmith

Hey @bhelx!

Quick update: So the post_fork didn't fix the issue, I continued to get the network error Remote end closed connection without response error messages.

I'm now trying the approach of creating the client inside the account_search method.

I'll read up on the before_first_request hook also.

🀞

phillipadsmith avatar Apr 01 '20 20:04 phillipadsmith

Happy Saturday @bhelx,

Just a quick update on this:

  • As crazy as it seemed, I tested your idea of [creating the client inside of each method].(https://github.com/TheTyee/manage-account.thetyee.ca/blob/master/app.py)
  • I then ran automated tests overnight, testing the complete flow that triggers each method.
  • After 12 hours of testing, ~600 tests, no errors have been reported. So, it seems like that resolved the intermitted network error that we were experiencing.

As you mention, this is not an ideal way to have something set-up for a production application, and yet this is a very low use application so I believe we're going to try it out in production.

I was never able to trigger a useful stack trace in part because Flask doesn't surface its errors up to Gunicorn, and the network errors only happened when running the application using Gunicorn, so it was a bit of a rabbit hole trying to figure out how to work around that while also hoping for the error to surface again.

All that said, I hope it's useful to know that there appears to be some kind of a intermitted network error being thrown by the Recurly client in this rather common configuration of Flask + Gunicorn.

Let me know if you have any questions. Know that we've got a work around that I can deploy for this project, I'm happy to go back and re-test things in my development environment.

Many thanks for your help & have a great weekend, Phillip.

phillipadsmith avatar Apr 04 '20 19:04 phillipadsmith

@phillipadsmith I appreciate you digging so deeply into it. It does sound like there is some kind of issue with Gunicorn and the open connection. I'd like to keep this issue open until we fix it. When my team gets a free moment, we'll try to make a minimal working example and see if we can reproduce. If it does turn out that there is a change that people will need to make in their Gunicorn apps, then we'll document it.

bhelx avatar Apr 04 '20 20:04 bhelx

I encountered this same problem. In my case, the problem was using a single recurly.Client instance for multiple connections. Probably the recurly requests were stomping on each other. Using a recurly.Client instance per connection resolved the "Remote end closed connection without response" problems.

declaresub avatar Jun 11 '21 15:06 declaresub