drill icon indicating copy to clipboard operation
drill copied to clipboard

Intelligent client connection load balancing

Open rymarm opened this issue 9 months ago • 3 comments

One drillbit may end up with significantly more client connections than others.

Here is an example of such a case: Image

The chart shows that the highest number of connections per node is 21, while the lowest is only 3 – a difference of 7 times.

Even though query execution considers the load on all Drillbits and distributes execution accordingly, overutilizing a single Drillbit for client connections can lead to issues such as heap memory exhaustion and excessive network bandwidth usage when returning query results to the client.

The current client implementation retrieves a list of active Drillbits from Zookeeper, randomly shuffles them:

java.util.Collections.shuffle(endpoints);

and attempts to connect to the first N nodes until a connection is successful. This approach was introduced as a simple way to prevent all clients from connecting to the same Drillbit: https://issues.apache.org/jira/browse/DRILL-2512. However, it does not fully solve the problem, as one node may still end up with significantly more connections than others.

Solution Every drillbit stores the his current number of client connections in Zookeeper. After retrieving a list of active drillbits, the client calculates the average number of connections and reorders the list based on whether a drillbit's connection count is above or below the average before attempting a connection.

The key questions I see here are:

  • Is it appropriate to store information about the current number of user connections in Zookeeper?
  • Should client connection load balancing be enabled be configurable and enabled by default?
  • Is it beneficial to distribute all connections evenly across the entire Drill cluster?

rymarm avatar Feb 17 '25 10:02 rymarm

@jnturton @cgivre Hi! What do you think about the idea of balancing client connections more evenly across the Drill cluster? Do you think it could improve overall performance, or could it introduce other challenges? I'd love to hear your thoughts and opinions on how this might play out.

rymarm avatar Feb 17 '25 10:02 rymarm

@rymarm I think it is an interesting idea. I have a feeling it would be harder than it may seem. The big question I have is how do you plan to get the metadata needed? Does Zookeeper do any of this? What strategy would you use for distributing queries across nodes?

For instance, let's say you have a lot of short running queries. Would it matter if they are all going to a single Drill node? Also are we thinking about data locality? Meaning is Drill sending the query to the node closest to the data?

This is an area of Drill that I don't know very much about, but these are my stream of consciousness thoughts.

cgivre avatar Feb 28 '25 16:02 cgivre

@rymarm this looks like a great idea to me.

Is it appropriate to store information about the current number of user connections in Zookeeper? Should client connection load balancing be enabled be configurable and enabled by default? Is it beneficial to distribute all connections evenly across the entire Drill cluster?

My own answers to all three of these are yes.

For instance, let's say you have a lot of short running queries. Would it matter if they are all going to a single Drill node?

It's still possible to randomise over tied winners after sorting the list of Drillbits in order of open connection count (if there are ten Drillbits with zero open connections, choose one of those at random).

Meaning is Drill sending the query to the node closest to the data?

For storage that supports this it happens later on, during exeuction, no matter which Drillbit the client originally connected to.

jnturton avatar May 11 '25 02:05 jnturton