azkaban icon indicating copy to clipboard operation
azkaban copied to clipboard

Implement HA solution for Azkaban Web Server

Open mukund-thakur opened this issue 9 years ago • 22 comments

Currently azkaban web server is SPOF which is very serious problem. We should invest in using zookeeper for making azkaban web server work in active/standby mode.

mukund-thakur avatar Mar 23 '17 12:03 mukund-thakur

Hi and thank you for the feedback. Could you provide more information about your proposal? Is this to replace the backend relation database with a key-value store (zookeeper), or to use zookeeper as a distributed lock? If possible, please provide examples of where and how this has been implemented. Thanks.

li-afaris avatar Mar 23 '17 17:03 li-afaris

+1 on the HA Web Server set up. Currently I have our Azkaban implementation set up on 2 separate servers, with the MySQL database replicating from one to the other. I have installed the web server on both, but can only have the jvm up on the primary, as if it is up on the secondary as well it will corrupt the database. It would be great if we could have multiple web server instances online with a load balancer in front.

devangorder avatar Mar 23 '17 20:03 devangorder

Yes, we would like to scale out web servers too.

We are actively researching ways to do that. e.g. use a message queue such as Kafka to hold the runnable flow queue instead of keeping this state in the web server.

Ideas are welcome.

HappyRay avatar Mar 24 '17 20:03 HappyRay

@HappyRay Is there any complete design document of azkaban. It would be really helpful for us to build a good design for making azkaban HA.

mukund-thakur avatar Mar 27 '17 07:03 mukund-thakur

Hi Mukund - this is good idea to have some design doc. As part of next work we are going to work on two things that should directly answer this questions

  1. As Ray mentioned we are working on scaling out web server. As initial step we will just move state out of web server, and provide distributed scheduler - this way you can run multiple web servers - effectively removing SPOF for web server.
  2. We will also be working on improved documentation - both should be ready and available in open source by mid July or so

ameyamk avatar Mar 27 '17 15:03 ameyamk

I have two ideas to solve this.

IDEA 1 We put all the in memory state of azkaban web server info( like runnableFlows etc ) in some data store(DS) , when active goes down send an event so that passive one gets everything in memory from DS. We can use the leader election algorithm provided by zookeeper recipes for sending the event to the passive node.

IDEA 2 We put all the in memory state of azkaban web server info( like runnableFlows etc ) in some data store(DS) ,there will be multiple web servers which will directly read/write from this DS rather than in memory. Clients will connect to a load balancer which will be on top of all web servers.

Choice of Data Store(DS) We can evaluate between mysql, couchbase, kafka (as suggested by @HappyRay ) and decide which one to use.

mukund-thakur avatar Mar 28 '17 14:03 mukund-thakur

Idea 2 is what we are implementing - with data store continues to be MySQL DB. We can replace this with something else - if it proves to be a performance bottleneck.

This should be ready and in open source by mid July.

Does that sound good?

On Tue, Mar 28, 2017 at 7:01 AM, mukund-thakur [email protected] wrote:

I have two ideas to solve this.

IDEA 1 We put all the in memory state of azkaban web server info( like runnableFlows etc ) in some data store(DS) , when active goes down send an event so that passive one gets everything in memory from DS. We can use the leader election algorithm provided by zookeeper recipes for sending the event to the passive node.

IDEA 2 We put all the in memory state of azkaban web server info( like runnableFlows etc ) in some data store(DS) ,there will be multiple web servers which will directly read/write from this DS rather than in memory. Clients will connect to a load balancer which will be on top of all web servers.

Choice of Data Store(DS) We can evaluate between mysql, couchbase, kafka (as suggested by @HappyRay https://github.com/HappyRay ) and decide which one to use.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/azkaban/azkaban/issues/952#issuecomment-289779324, or mute the thread https://github.com/notifications/unsubscribe-auth/AGN0ASRq1Hmvrn6I4rfloIWx9VvfHmOQks5rqRK2gaJpZM4MmqdT .

hreview avatar Mar 28 '17 16:03 hreview

This sounds great.

mukund-thakur avatar Mar 29 '17 19:03 mukund-thakur

When can release ?

yxydde avatar May 03 '17 03:05 yxydde

Likely around August 2017. Roadmap here: https://github.com/azkaban/azkaban/wiki/Azkaban-4.x--Roadmap

ameyamk avatar May 03 '17 18:05 ameyamk

Hi. I have a query on implementation of HA being done. Do you plan to restart jobs in 'Running' status after HA is done? Currently, the Running jobs move to Failed state when server is manually restarted after crash. Also, if I have a periodic scheduled job running say every 1 mins and current time is say 7:00. My Azkaban server crashes and it restarts at say at 7.10. Job instances between 7:00 and 7.10 will be missed. Do you plan to relaunch these instances as well after HA ?

goelrajat avatar May 26 '17 11:05 goelrajat

Would stickyness be a factor that needs to be taken into account as well. I have a user here who is deployed behind an AWS ELB and as a result losses sessions (client IP changed in the headers). X-Forward-For may be a work-around?

sellers avatar Feb 20 '18 18:02 sellers

Any news here ? Roadmap says, azkaban 4.0 should have HA webservers and should have been released in Q2 2017. but there are only 3.xx versions available yet.

steverding avatar Jun 18 '18 13:06 steverding

I am looking forward to the HA to be released

happyapple668 avatar Sep 12 '18 11:09 happyapple668

I am looking forward to the HA to be released

gao634209276 avatar Dec 26 '18 08:12 gao634209276

We'll definitely prioritize the web server HA work. The first step of removing the cache on web server is already implemented but not enabled yet. Once web server becomes stateless, we can proceed with the next step of bringing up multiple web servers. This needs to be carefully designed and tested though. Thanks for your patience and please expect more time from our side.

jamiesjc avatar Jan 02 '19 20:01 jamiesjc

Any update here? what is the expected time for Azkaban HA release.

@jamiesjc @hreview

avi-0107 avatar Mar 13 '19 12:03 avi-0107

@ameyamk Any update please let us know.

praxnet avatar Dec 17 '19 07:12 praxnet

Any news guys ?

oonashvili avatar Jan 13 '20 08:01 oonashvili

Azkaban web is now our single point of failure and HA is more than desirable, we tried to run 2 instances behind a VIP, and all jobs gets duplicated :/

rafilkmp3 avatar Jan 30 '20 14:01 rafilkmp3

Idea 2 is what we are implementing - with data store continues to be MySQL DB. We can replace this with something else - if it proves to be a performance bottleneck. This should be ready and in open source by mid July. Does that sound good? On Tue, Mar 28, 2017 at 7:01 AM, mukund-thakur @.***> wrote: I have two ideas to solve this. IDEA 1 We put all the in memory state of azkaban web server info( like runnableFlows etc ) in some data store(DS) , when active goes down send an event so that passive one gets everything in memory from DS. We can use the leader election algorithm provided by zookeeper recipes for sending the event to the passive node. IDEA 2 We put all the in memory state of azkaban web server info( like runnableFlows etc ) in some data store(DS) ,there will be multiple web servers which will directly read/write from this DS rather than in memory. Clients will connect to a load balancer which will be on top of all web servers. Choice of Data Store(DS) We can evaluate between mysql, couchbase, kafka (as suggested by @HappyRay https://github.com/HappyRay ) and decide which one to use. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#952 (comment)>, or mute the thread https://github.com/notifications/unsubscribe-auth/AGN0ASRq1Hmvrn6I4rfloIWx9VvfHmOQks5rqRK2gaJpZM4MmqdT .

Any news?

lchqlchq avatar Apr 26 '21 09:04 lchqlchq

seems no news

sansna avatar Sep 10 '21 06:09 sansna