webhdfs
webhdfs copied to clipboard
reuse HTTP connections
This change provides the ability to use the HTTP connection multiple times. The HTTP connection can now be reused over consecutive calls to the same host:port combination. Experience with this optimization on Altiscale's infrastructure indicates that it can improve application-level performance by a factor of 2.
The new reuse_connection class variable is set to false (disabling the new functionality) by default, so that existing clients can rely on the original semantics of the class. For example, some clients might depend on the ability to change class variables (e.g. open_timeout) between calls to the request method. Changing class variables between calls to request is not supported with the reuse_connection optimization.
The connection set-up code that was in the request method has been refactored into the private create_connection method, which is called by the private connection method. The reuse optimization is implemented by the private connection and reuse_connection_if_possible methods.
I have some points:
- What's happen when servers (NameNode, DataNode and/or HttpFs server / reverse proxies) disconnects keep-alived connections?
- With WebHDFS (not HttpFs), clients send two requests to write data on HDFS, to NameNode and DataNode
-
- request to NameNode (finish connection to DataNode if exists, and cache connection to NameNode)
-
- request to DataNode (finish connection to NameNode, and cache connection to NameNode)
-
- Is it intentional? It looks better for me to cache connections for each host-port pair.
Thanks for the comments, and sorry for the delay in my response! I thought about these issues when writing the code, and will write a detailed response.