Avoid frequently TCP reconnecting
I've noticed that when using the pyrax to do bulk uploads, it keeps closing and creating TCP connections to the same host, this is very inefficient, and should be avoid.
The solution for this issue is using a connection pool to manage HTTP(TCP) connections, because the requests library has implemented the connection pool (which is in the requests.Session module), I'm changing the pyrax.http module to use it.
For the concern of thread-safe, I'm using a thread local storage to store the sessions for each individual thread, so there will be no interference between multiple threads sharing one pyrax instance.
Coverage increased (+0.0%) to 96.72% when pulling 6e05325c08c62f7bcf43ca881c24b6cf8d2e5784 on micro-alex:master into 70d5c10feceebd9be3991858b047f6587bef5e5e on rackspace:master.
I'm not sure that this is the correct approach. Since pyrax can work within contexts, I think that a session would need to be created for the individual contexts, rather than at a global level.
@sivel I'll take a look tomorrow, especially focus on does share connections between context do any harms.
I've closely looked the code, seems there are a few opetions:
1, global sessions for each thread in pyrax.http
This is the implemented one in this PR
2, add session to the BaseClient class
In this case, whether user is using the context, the connection pool will working properly as long as the user doesn't create new client object each time they doing something with pyrax. However, if the user has large amount of *Client instance, this will cause the program keep large amount TCP connections. Also, this will increase the cost of creating / destroying any *Client objects.
3, add session to the BaseIdentity class
Users will call ctx.get_client() function to get the actually *Client class, so the session will pass to all *Client objects created from the get_client() function, all of the *Client instance created form the same context will share the same session (connection pool), this maybe NOT thread-safe. See stackoverflow question of this.
Also, obviously, if an user doesn't use context, this will be not helpful at all, the user will still suffering from recreating connections.
@sivel Which solution do you think is better? Or, maybe you have better suggestions?
BTW, I think I've made a little mistake - this PR is against the master branch, not the working branch.
Would you be able to send me a generic example script that shows the operation(s) that would be benefitted by this change - basically, what were you running that caused you to see a need for this change? We're in the process of building out a replacement for pyrax and we're reasonably close to "done" in a number of areas*, and it's built from the ground up around sessions. I'd like to see that we're doing the right thing in that project, and perhaps have you look at it as well.
* It's an OpenStack project right now, but most of the key areas of it will already work verbatim for Rackspace such as Cloud Servers and Cloud Files. There's still a few other services we'll need Rackspace-specific support added to, but they're coming up soon.
@briancurtin here's a simple test script I was using for test, please open a tcpdump / tcptrack or other network monitoring tools when you running this script.
import threading
import random
import pyrax
def randomdata(length):
return ''.join([chr(random.randint(0, 255)) for i in range(length)])
def thread_func_inf_obj_handle(thread_num, cf):
counter = 0
while True:
filename = 'TEST_%d_%d.test' % (thread_num, counter)
counter += 1
content = randomdata(random.randint(1000, 100000))
cf.store_object(filename, content, return_none=True)
def main():
pyrax.settings.set('identity_type', 'rackspace')
pyrax.set_credential_file(MY_CREDENTIAL_FILE)
cloudfiles = pyrax.cloudfiles
cloudfiles.timeout = 120
cf = cloudfiles.get_container('test')
threads = []
for i in range(10):
threads.append(threading.Thread(
target=thread_func_inf_obj_handle,
args=(i, cf)))
threads[-1].start()
for t in threads:
t.join()
if __name__ == '__main__':
main()
In our production environment, the filename and randomdata() is replaced by real data, and the while True is replaced by a for loop to iterate some data.
Just in case you need to know the credential file:
[rackspace_cloud]
username = xxx
api_key = xxx
region = LON
identity_type = rackspace
I have been waiting for a month now, I'm hoping someone can review this PR, then decide to merge it, close it or give me some better ideas.
I'll leave this one to sivel to decide if he wants to merge it - he's on vacation for a while so that may take some time. This looks maybe ok, but I'd want to be sure from him if it has any implications when used with Ansible. Pyrax being lacking in functional tests and having fragile unit tests means a change like this is fairly risky.
As I mentioned before, we're working on the replacement for pyrax and so are not actively developing this project. We expect to have an announcement soon on what the official plans are for deprecation, release planning, etc.
Thanks, I'm looking forward.
@briancurtin Hi, has @sivel come back for the vacation?
Any news about this pull request? Almost two years gone already...
We are no longer supporting Pyrax.