longhorn-manager
longhorn-manager copied to clipboard
grpc-proxy: refactor proxy client
https://github.com/longhorn/longhorn/issues/3967
Test result #1283
Client-pool:
- Full - https://ci.longhorn.io/job/private/job/longhorn-tests-regression/1364/
- Full - https://ci.longhorn.io/job/private/job/longhorn-tests-regression/1368/
- Core - https://ci.longhorn.io/job/private/job/longhorn-tests-regression/1363/
- Core - https://ci.longhorn.io/job/private/job/longhorn-tests-regression/1367/
- Full - https://ci.longhorn.io/job/private/job/longhorn-tests-regression/1362/
- Full - https://ci.longhorn.io/job/private/job/longhorn-tests-regression/1366/
- Core - https://ci.longhorn.io/job/private/job/longhorn-tests-regression/1361/
- Core - https://ci.longhorn.io/job/private/job/longhorn-tests-regression/1365/
- In the previous design, we are distributing the grpc calls to multiple instance manager pods
Not sure if you are referring to the gRPC process communication? This is a new proxy server is to replace manager engine binary calls, Communication to the gRPC process server is still the same.
Will it be a problem for an HA distributed system when that single instance manager is not available?
I think this should be fine.
Assuming the IMs on a node are unavailable, for the corresponding longhorn manager pod (on the same node):
- First, the engine/replica instance starting/stopping is exactly the same as the previous version.
- Then for most of the engine calls, the longhorn manager pod has no way to handle them before or after the proxy since the IMs on this node are unavailable.
- As for some backup/file-related calls, the longhorn manager pods will directly talk with the replicas directly before introducing the proxy. But the communication will be done by the proxy server after introducing the proxy. I guess this part is the main cause of your question, right? But when diving into the previous implementations, we can find that the longhorn-manager pod has to talk with the engine before sending requests to replicas. In other words, if the IM or (engine IM) is unavailable, there is no way to handle this kind of calls before or after the proxy.
I will keep this as draft after discussing it with @c3y1huang , and revisit this after https://github.com/longhorn/longhorn/issues/4038.
As for some backup/file-related calls, the longhorn manager pods will directly talk with the replicas directly before introducing the proxy. But the communication will be done by the proxy server after introducing the proxy. I guess this part is the main cause of your question, right? But when diving into the previous implementations, we can find that the longhorn-manager pod has to talk with the engine before sending requests to replicas. In other words, if the IM or (engine IM) is unavailable, there is no way to handle this kind of calls before or after the proxy.
Thanks @shuo-wu for the explanation! Looks like there is problem with backup get/delete operations:
- Before the proxy, when the instance manager on the node goes down, backup volume controller still can read backup and delete backup
- after the proxy, this is not the case.
I am not sure if there are other cases
- Before the proxy, when the instance manager on the node goes down, backup volume controller still can read backup and delete backup
- after the proxy, this is not the case.
I am not sure if there are other cases
Will the backup CRs be taken over by other nodes in this case? If No, this is probably one issue we need to resolve.
I am not sure if there are other cases
Will the backup CRs be taken over by other nodes in this case? If No, this is probably one issue we need to resolve.
From the segregation point of view, this is the new criteria for any volume operations (replica related) that should pass through the proxy (engine) only.
What we should really expect is that backup-related CRs will be tackled at the end when the IM pod comes back or as @shuo-wu mentioned if they will be transferred to another manager to handle. IIRC, the transferring will only happen when the node is not ready or the manager pod on that node is unavailable.
. IIRC, the transferring will only happen when the node is not ready or the manager pod on that node is unavailable.
Yeah, I backup CR will not be transferred to other nodes because the manager pod is still running
. IIRC, the transferring will only happen when the node is not ready or the manager pod on that node is unavailable.
Yeah, I backup CR will not be transferred to other nodes because the manager pod is still running
@shuo-wu will this be a concern for us?
YES. If the instance manager pods on some nodes are unavailable due to some issues like CPU resource exhaustion or wrong tolerations, the backup CRs on the corresponding nodes cannot be refreshed or removed. I am thinking if we really need to ask the proxy to handle all kinds of backup cmds. Some APIs like BackupList or BackupDelete are totally unrelated to the StorageNetwork or engine/replica processes, which can be done by the longhorn managers.
Close this first as now we have no requirement for using a proxy client pool.