Protobuf performance
At least for large scans that I've profiled I've found that a lot of the time spent in the client is demarshaling the pb response from HBase.
There may be room to switch to a faster pb library or to a pb C-binding to utilize other cores.
Some quick research shows that the protobuf compiler supports creating native code for python. See http://yz.mit.edu/wp/fast-native-c-protocol-buffers-from-python/ for a description of the usage and https://developers.google.com/protocol-buffers/docs/reference/python-generated?hl=en#cpp_impl for Google's docs on it.
Note: its an experimental feature as of July 10, 2015 and would break compatibility with python implementations other than cpython
I vote we compile both native Python and a C++ implementation for our PBs. Switch between the two depending on either config or environment variables (we could possibly detect if they're running on CPython or a platform which doesn't support C modules)