AgentBaker icon indicating copy to clipboard operation
AgentBaker copied to clipboard

/opt/azure/containers/provision.sh needs to respect DNS failover mechanism

Open alexxiongxiong opened this issue 2 years ago • 1 comments

Hi team, I am testing DNS failover during AKS creation. An unreachable IP is set as my primary DNS server of AKS and the secondary DNS server is good.

Then the CSE of VMSS failed to be provisioned.

Error messages: Enable failed: failed to execute command: command terminated with exit status=124 [stdout] { "ExitCode": "124", "Output": "ookup deaaks901-deaaks-c4ab-hu74r11b.hcp.eastasia.azmk8s.io\n++ '[' 60 -eq 100 ']'\n++ sleep 1\n++ for i in $(seq 1 $retries)\n++ timeout 10 nslookup deaaks901-deaaks-c4ab-hu74r11b.hcp.eastasia.azmk8s.io\n++ '[' 61 -eq 100 ']'\n++ sleep 1\n++ for i in $(seq 1 $retries)\n++ timeout 10 nslookup deaaks901-deaaks-c4ab-hu74r11b.hcp.eastasia.azmk8s.io\n++'[' 79 -eq 100 ']'\n++ sleep 1\n++ for i in $(seq 1 $retries)\n++ timeout 10 nslookup deaaks901-deaaks-c4ab-hu74r11b.hcp.eastasia.azmk8s.io\n++ 

After some investigation, we found the issue may relate to the script /opt/azure/containers/provision.sh image

It seems that the timeout setting in this script doesn't respect the DNS failover mechanism because base on our tests, the nslookup needs 15 seconds to complete the failover.

test command: time nslookup google.com

Could you please extend the timeout period of nslookup in this script to 20 sec? it will give more enhancement for this project. Thank you!

alexxiongxiong avatar May 10 '23 09:05 alexxiongxiong

we adjusted the params - https://github.com/Azure/AgentBaker/blob/b8a0752b678aef5c7d2c472f96659294cf7a730b/parts/linux/cloud-init/artifacts/cse_main.sh#L320

does that work for you?

alexeldeib avatar Jul 12 '23 12:07 alexeldeib

I believe this was fixed, feel free to re-open if not

cameronmeissner avatar Jun 28 '24 19:06 cameronmeissner