inlong
inlong copied to clipboard
[Feature] Optimize load balancing for DataProxy
Description
When the SDK produces data, it will select a group of nodes to send data, and when some of them have problems, they will be eliminated, and new candidate nodes will be selected; It is expected to reduce the impact on production so that the client is unaware; b, after the server node is restored, it can be selected again; c, ensure the node load balance.
Use case
No response
Are you willing to submit PR?
- [ ] Yes, I am willing to submit a PR!
Code of Conduct
- [X] I agree to follow this project's Code of Conduct
@dockerzhang This is a good idea, I would like to do this.
Motivation
Based on my reading of the source code. Currently, the DataProxy SDK side selects DataProxy nodes using polling (sending messages in TCP mode) and random selection (sending messages in HTTP mode).The polling method is not efficient enough, and the random method is not easy to achieve load balancing.
Changes
Use consistent hashing algorithm instead of the original polling and random
Mechanism Options
Consistent Hash Algorithm and Virtual Node Mechanism Refer to the article for details on the algorithm
Design
Based on my reading of the source code.The following are the functions that need to be modified:
- org.apache.inlong.sdk.dataproxy.network.ClientMgr.getClientByRoundRobin():This function obtains the DataProxy node by polling
- org.apache.inlong.sdk.dataproxy.http.InternalHttpSender.sendMessageWithHostInfo(List<String> bodies, String groupId, String streamId, long dt, long timeout, TimeUnit timeUnit):This function implements the selection of DataProxy nodes by randomly selecting HostInfo
- Need to update the fields of the DataProxy node class to add information about virtual nodes
- The hash ring and virtual nodes need to be completed on the DataProxy side, and the strategy for acquiring DataProxy nodes on the SDK side must be updated at the same time.
In the point 4, it is said that "The hash ring and virtual nodes need to be completed on the DataProxy side", here, does "DataProxy node" means those nodes already having established connections or all DataProxy
nodes in a cluster? What's more, for SDK
, it obtains DataProxy
nodes information by interacting with Manager
. Could you specifically describe the updating strategy of point 4?
- The nodes joining the hash ring should be the nodes that establish the connection. After all, it is impossible to send data to the nodes that have not established the connection.
- When I looked at the source code before, I didn't notice that the SDK and Dataproxy interacted through the Manager, thinking they interacted directly.Then the hash ring should be done at the end that maintains the node information.Then there are two cases:(1)Manager maintains node information.The update strategy is to notify the Manager when the DataProxy node changes, and the Manager will uninstall the node in the hash ring, migrate node data according to the consistent hash algorithm, or request redirection and other operations. During this process, the node application from the SDK will be redirected to the new node, so that the client is not aware.(2)The DataProxy side maintains node data.Similar to (1), it only affects where the consistent hashing algorithm is implemented
- updating strategy of point 4:(1)When the DataProxy node changes, first generate virtual nodes with virtual_num number and add them to the table of real nodes and virtual nodes.(2)Add all virtual nodes to the hash ring, because the system only writes and does not read from SDK to DataProxy (if there is no misunderstanding, it should be like this). Therefore, when modifying the hash ring, if the SDK has a node request, it only needs to redirect the request to the new node.