trafodion icon indicating copy to clipboard operation
trafodion copied to clipboard

TRAFODION-2940 In HA env, one node lose network, when recover, trafci can't use

Open mashengchen opened this issue 7 years ago • 17 comments

when there loses network for a long time ,and then network recover, there will trigger zookeeper session expired, at this time ,check whether current dcsmaster is leader, if not unbind this node's floating ip, and make dcsmaster init status, then rerun dcs master.

mashengchen avatar Jan 31 '18 10:01 mashengchen

Check Test Started: https://jenkins.esgyn.com/job/Check-PR-master/2399/

Traf-Jenkins avatar Jan 31 '18 10:01 Traf-Jenkins

Test Passed. https://jenkins.esgyn.com/job/Check-PR-master/2399/

Traf-Jenkins avatar Jan 31 '18 13:01 Traf-Jenkins

Hope some DCS experts to take a review. @mashengchen please invite proper DCS experts. I cannot understand these changes well.

moscowgentalman avatar Feb 07 '18 03:02 moscowgentalman

@arvind-narain @kevinxu021 can you help to take a look

mashengchen avatar Feb 08 '18 03:02 mashengchen

Can you please describe with more details regarding the original issue in the JIRA. It is unclear to me the scenario with 2 floating IPs. Let's hold off on the merge until we understand the original issue that will help us understand the changes done

hegdean avatar Feb 09 '18 19:02 hegdean

  1. config ha env
  2. sqstart
  3. use iptables to down master node's network, (iptables -I INPUT -s hostname -j DROP)
  4. sleep for 300 seconds
  5. backup-master take over the master role
  6. recover network (iptables -I INPUT -s hostname -d DROP)
  7. pdsh $MY_NODES ifconfig|grep 23400 will have 2 results, one is the down dcsmaster and another is the backup-master actually , the old dcsmaster is still in while loop when network recover, so there will have 2 working dcsmaster.

mashengchen avatar Feb 11 '18 09:02 mashengchen

@hegdean, are you happy with this change now? Should I merge it?

DaveBirdsall avatar Feb 15 '18 00:02 DaveBirdsall

@svarnau yes , I had done the test, and it solved the duplicate IP problem

mashengchen avatar Mar 06 '18 09:03 mashengchen

Check Test Started: https://jenkins.esgyn.com/job/Check-PR-master/2475/

Traf-Jenkins avatar Mar 13 '18 08:03 Traf-Jenkins

Test Passed. https://jenkins.esgyn.com/job/Check-PR-master/2475/

Traf-Jenkins avatar Mar 13 '18 13:03 Traf-Jenkins

Check Test Started: https://jenkins.esgyn.com/job/Check-PR-master/2691/

Traf-Jenkins avatar May 30 '18 06:05 Traf-Jenkins

Test Passed. https://jenkins.esgyn.com/job/Check-PR-master/2691/

Traf-Jenkins avatar May 30 '18 09:05 Traf-Jenkins

@arvind can u please review this

hegdean avatar May 31 '18 00:05 hegdean

I think the wrong Arvind was tagged :)

arvind avatar May 31 '18 01:05 arvind

@arvind-narain can u please review

hegdean avatar May 31 '18 01:05 hegdean

Check Test Started: https://jenkins.esgyn.com/job/Check-PR-master/2887/

Traf-Jenkins avatar Jul 27 '18 12:07 Traf-Jenkins

Test Failed. https://jenkins.esgyn.com/job/Check-PR-master/2887/

Traf-Jenkins avatar Jul 27 '18 16:07 Traf-Jenkins