incubator-livy icon indicating copy to clipboard operation
incubator-livy copied to clipboard

[LIVY-11] Enable HA support

Open RogPodge opened this issue 5 years ago • 25 comments

What changes were proposed in this pull request?

This pull request enables Active/Passive HA in Livy through the use of Zookeeper. The CuratorElectorService class coordinates leadership election between multiple Livy instances

https://issues.apache.org/jira/browse/LIVY-11

How was this patch tested?

unit tests manual verification using 2 Livy instances

RogPodge avatar Aug 24 '19 00:08 RogPodge

Codecov Report

Merging #212 into master will decrease coverage by 0.47%. The diff coverage is 38.09%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master     #212      +/-   ##
============================================
- Coverage     68.12%   67.64%   -0.48%     
- Complexity      960      980      +20     
============================================
  Files           104      106       +2     
  Lines          5946     6068     +122     
  Branches        899      911      +12     
============================================
+ Hits           4051     4105      +54     
- Misses         1314     1378      +64     
- Partials        581      585       +4     
Impacted Files Coverage Δ Complexity Δ
...g/apache/livy/server/DomainRedirectionFilter.scala 0.00% <0.00%> (ø) 0.00 <0.00> (?)
...main/scala/org/apache/livy/server/LivyServer.scala 31.72% <29.41%> (-1.31%) 15.00 <5.00> (+4.00) :arrow_down:
...org/apache/livy/server/CuratorElectorService.scala 50.00% <50.00%> (ø) 10.00 <10.00> (?)
...cala/org/apache/livy/sessions/SessionManager.scala 83.33% <75.00%> (+1.51%) 28.00 <1.00> (+1.00)
...rver/src/main/scala/org/apache/livy/LivyConf.scala 96.22% <100.00%> (+0.09%) 21.00 <0.00> (ø)
...va/org/apache/livy/client/http/LivyConnection.java 82.27% <0.00%> (+0.22%) 15.00% <0.00%> (ø%)
...c/main/java/org/apache/livy/LivyClientBuilder.java 84.37% <0.00%> (+1.32%) 20.00% <0.00%> (+3.00%)
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update f4ab5ef...eb44f46. Read the comment docs.

codecov-io avatar Aug 27 '19 01:08 codecov-io

How does this pr related to pr #189 ?

yantzu avatar Aug 27 '19 03:08 yantzu

This pr tackles a different feature from the one in pr 189. We're mainly concerned with enabling active/standby HA in systems that can Livy Server instances on multiple nodes

RogPodge avatar Aug 27 '19 04:08 RogPodge

Do you need a proxy-like thing to find the latest active node?

OpenOpened avatar Aug 27 '19 08:08 OpenOpened

No its handled by apache curator. All the latest active node (the leader) is recorded in the Zookeeper State Store and is then known by all other servers who are coordinated via that state store

RogPodge avatar Aug 27 '19 18:08 RogPodge

Nice !

Tagar avatar Aug 27 '19 19:08 Tagar

How do the clients know the latest active node? Do they need to ask apache curator? If so, do we need to update the client code?

When people use a restful client(e.g. curl) to communicate with Livy, should they query zk every time?

yiheng avatar Aug 29 '19 04:08 yiheng

Another question is if the current leader crash and a new leader is elected, should the session state in the old leader be restored in the new leader?

yiheng avatar Aug 29 '19 06:08 yiheng

Another question is if the current leader crash and a new leader is elected, should the session state in the old leader be restored in the new leader?

The code does not do the state synchronization related to the Session State. Is it a good choice to put it in zk?

OpenOpened avatar Aug 29 '19 06:08 OpenOpened

How do the clients know the latest active node? Do they need to ask apache curator? If so, do we need to update the client code?

When people use a restful client(e.g. curl) to communicate with Livy, should they query zk every time?

#189 implements Livy Server discovery to provide the ability to get current Livy Server address by client to use it in REST requests. There is Java/Scala API for it. If someone uses not Java world client then he should ask ZK to get the address.

o-shevchenko avatar Aug 29 '19 11:08 o-shevchenko

When HA mode is enabled, the following happens.

Once new Livy Server starts, it uses Curator to connect to the Zookeeper state store to try to acquire leadership. The process will wait until it is able to acquire leadership. The leader latch library will automatically ensure that only 1 Livy Server among those that are connect to that state store will be elected Leader at any point.

Once it acquires leadership, it starts the Livy Web Server in the same way that it's currently handled, recovering session state if it is configured and all other associated configurations.

The client ideally doesn't need to do anything with respect to contacting the zookeeper server, as ideally you have a load balancer of some sort failover controller to eventually link the client to the actively running Livy instance. This solution is primarily concerned with ensuring that given a group of Livy Servers deployed in HA mode, exactly one of them will be active and running at any one time.

RogPodge avatar Aug 29 '19 19:08 RogPodge

Another question is if the current leader crash and a new leader is elected, should the session state in the old leader be restored in the new leader?

The code does not do the state synchronization related to the Session State. Is it a good choice to put it in zk?

This is already implemented previously. You can set livy.server.recovery.state-store to "zookeeper" and livy.server.recovery.state-store.url to zookeeper quorum. This PR mainly concerns the leader election part for the active-passive Livy Server.

shanyu avatar Sep 04 '19 17:09 shanyu

How do the clients know the latest active node? Do they need to ask apache curator? If so, do we need to update the client code? When people use a restful client(e.g. curl) to communicate with Livy, should they query zk every time?

#189 implements Livy Server discovery to provide the ability to get current Livy Server address by client to use it in REST requests. There is Java/Scala API for it. If someone uses not Java world client then he should ask ZK to get the address.

Another solution is we can put all active/standby services behind a load balancer. The client only needs to communicate with that load balancer. Then there's no need to change the client-side code. The server switch should be transparent to the client.

yiheng avatar Sep 19 '19 12:09 yiheng

Can someone help get this feature approved and checked in?

@jerryshao

RogPodge avatar Sep 23 '19 21:09 RogPodge

How do you do state transfer from old server to newly elected livy server? How do you do server discovery? How do you do session recovery? I was wondering such a simple solution is good enough. Also there's no UT and IT for the code, I don't think it is ready to merge.

jerryshao avatar Sep 24 '19 02:09 jerryshao

State transfer and session recovery are handled by the pre-existing livy.server.recovery.mode and livy.server.recovery.state-store components, provided they are configured to a Zookeeper instance.

Server discovery is decentralized though, would it be better to have an explicit configuration? I feel like having that wouldn't give any user-side benefits, given that there is no external command line tool to check the status of Livy instances.

RogPodge avatar Sep 24 '19 18:09 RogPodge

Livy HA with redirection:

In this pull request we are adding Active/Standby High Availability (HA) to Livy Server. The goal of HA is to ensure that amoung a group of Livy Servers, only 1 is Active at any given time while the others are on Standby. Our changes are as following:

Leveraging Zookeeper to handle Leader Election: In order to achieve HA, we need a service to handle keeping track of the current leader. To do this, we created a CuratorElectorService class which will register the Livy Server instance into the Leader election group and keep track of the leader. This class leverages the Apache Curator LeaderLatch Library to keep track of the current leader. This class is also responsible for restarting the Livy Webserver when it transitions from Standby state to Active. This restart is where the Livy Webserver performs session recovery.

Redirecting to the active webserver: With the CuratorElectorService class, Livy Server can now determine whether it is Active or on Standby. If it is active, it proceeds as usual. However, if it is on standby, it will use the configuration in livy.conf to determine the endpoint to redirect the request to.

Creating configurations to enable HA: We added user configurations to enable HA. These include specifying whether or not HA is enabled, the endpoints of the Livy Server instances used for HA, and the zookeeper instances used for leader election.

Splitting Livy's start() function to init() and start(): Currently, Livy's start() function is overloaded to both initalize all of its components (the webserver, its servlets, and session managers) and then safely start them. This PR separates the start() function into an init() component and a start() components. This is to allow us to restart the session managers and trigger session recovery without having to safely tear down and rebuild the Livy Webserver every time.

@jerryshao can I get some eyes for a code review/suggestions?

RogPodge avatar Mar 11 '20 00:03 RogPodge

@jerryshao small ping for a review

RogPodge avatar Mar 13 '20 22:03 RogPodge

@jerryshao Another ping

RogPodge avatar Mar 16 '20 18:03 RogPodge

@jerryshao pinging again

RogPodge avatar Mar 18 '20 23:03 RogPodge

@jerryshao another ping

RogPodge avatar Mar 23 '20 17:03 RogPodge

@jerryshao

RogPodge avatar Mar 27 '20 18:03 RogPodge

@jerryshao

RogPodge avatar Apr 05 '20 22:04 RogPodge

@jerryshao

RogPodge avatar Apr 27 '20 23:04 RogPodge

Taking over the pull request with @RogPodge 's permission. A bug involving query omission on failover was fixed, along with a bit of minor refactoring and renaming.

I'm not too familiar with the code review structure of this repo, but can someone take a look at the PR and its changes? (Pinging @jerryshao based on prior messages as well)

jameschen1519 avatar Jul 23 '20 10:07 jameschen1519