ovn-kubernetes icon indicating copy to clipboard operation
ovn-kubernetes copied to clipboard

Load balancer failed to be created

Open cathy-zhou opened this issue 10 months ago • 12 comments

Load balancer failed to be created because it took long time to initialize nodeTracker in case there were lots of nodes. As the result, service controller handler go routine timed out after 20 seconds and returned error.

This commit increases HandlerSyncTimeout to 3 minutes, and it would be a fatal error if service controller fails to start.

- What this PR does and why is it needed

- Special notes for reviewers

- How to verify it

- Description for the changelog

cathy-zhou avatar Apr 10 '24 23:04 cathy-zhou

Deploy Preview for subtle-torrone-bb0c84 ready!

Name Link
Latest commit cb208b23e5c6ff728d95267f81afc3a2e2579cbf
Latest deploy log https://app.netlify.com/sites/subtle-torrone-bb0c84/deploys/66171f0497247c000872c8d3
Deploy Preview https://deploy-preview-4270--subtle-torrone-bb0c84.netlify.app
Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

netlify[bot] avatar Apr 10 '24 23:04 netlify[bot]

/lgtm Let's rerun CI though /retest-failed (might not work)

ricky-rav avatar Apr 16 '24 19:04 ricky-rav

Oops, something went wrong:

Must have admin rights to Repository.

ovn-robot avatar Apr 16 '24 20:04 ovn-robot

I'm OK with bumping the timeout, but would it be more prudent to run pprof in your setup and find out what is taking so long?

@winsopc has kindly helped me to profile and found out that svcCreateOrUpdateTemplateVar() funciton is the bottleneck. Note that if we disable svc template, nodeTracker initialization of the same zone only took 2 seconds compared to 55 seconds if svc template is enabled. @trozet

cathy-zhou avatar May 02 '24 23:05 cathy-zhou

Coverage Status

coverage: 52.422%. remained the same when pulling d979321b5bbac43583a579bc2ffdfa7a78e9a680 on cathy-zhou:SDN-2653 into 12c5f08dff40bc0c74983b39a3e3786e5a18c7c8 on ovn-org:master.

coveralls avatar May 03 '24 00:05 coveralls

thanks @cathy-zhou @winsopc . Do either of you happen to have the pprof file or graph output you can share? I'm guessing the problem is this function:

func forEachNBTemplateInMaps(templateMaps []TemplateMap, callback func(nbTemplate *nbdb.ChassisTemplateVar) bool) {
	// First flatten the maps into *nbdb.ChassisTemplateVar records:
	flattened := ChassisTemplateVarMap{}
	for _, templateMap := range templateMaps {
		for _, template := range templateMap {
			for chassisName, templateValue := range template.Value {
				nbTemplate, found := flattened[chassisName]
				if !found {
					nbTemplate = &nbdb.ChassisTemplateVar{
						Chassis:   chassisName,
						Variables: map[string]string{},
					}
					flattened[chassisName] = nbTemplate
				}
				nbTemplate.Variables[template.Name] = templateValue
			}
		}
	}

	// Now walk the flattened records:
	for _, nbTemplate := range flattened {
		if !callback(nbTemplate) {
			break
		}
	}
}

@dceara FYI

trozet avatar May 03 '24 13:05 trozet

So the code @ricky-rav added skipped the service add stuff during RequestFullSync on initial start up, but it did not skip c.syncNodeInfos(nodeInfos): https://github.com/ovn-org/ovn-kubernetes/blob/master/go-controller/pkg/ovn/controller/services/services_controller.go#L527

This ends up calling svcCreateOrUpdateTemplateVar for all nodeInfos: https://github.com/ovn-org/ovn-kubernetes/blob/master/go-controller/pkg/ovn/controller/services/services_controller.go#L511

Which calls svcCreateOrUpdateTemplateVarOps, and then forEachNBTemplateInMaps.

I think the fix here is probably to skip svcCreateOrUpdateTemplateVar in syncNodeInfos, and just let the normal service Add take care of ensuring the load balancer templates are correct in OVN. @dceara wdyt?

In the meantime @cathy-zhou @winsopc could you try removing lines 506-514, from syncNodeInfos and see if it solves your issue?

trozet avatar May 03 '24 14:05 trozet

https://drive.google.com/file/d/1SxyNlqJ-93mDv4mFvVzJNZ0JP0h4eKNL/view?usp=sharing @trozet

winsopc avatar May 08 '24 18:05 winsopc

@winsopc thanks. I think that proves the theory. Can you or @cathy-zhou try removing those lines I mentioned and see how it goes?

trozet avatar May 09 '24 21:05 trozet

@trozet with the lines removed, nodeTracker initialization of the same zone took less than 2 seconds. but that is for experimental only, right? as it would fail to create template for nodeIPs, and nodePort service would fail.

cathy-zhou avatar May 09 '24 21:05 cathy-zhou

So the code @ricky-rav added skipped the service add stuff during RequestFullSync on initial start up, but it did not skip c.syncNodeInfos(nodeInfos): https://github.com/ovn-org/ovn-kubernetes/blob/master/go-controller/pkg/ovn/controller/services/services_controller.go#L527

This ends up calling svcCreateOrUpdateTemplateVar for all nodeInfos: https://github.com/ovn-org/ovn-kubernetes/blob/master/go-controller/pkg/ovn/controller/services/services_controller.go#L511

Which calls svcCreateOrUpdateTemplateVarOps, and then forEachNBTemplateInMaps.

I think the fix here is probably to skip svcCreateOrUpdateTemplateVar in syncNodeInfos, and just let the normal service Add take care of ensuring the load balancer templates are correct in OVN. @dceara wdyt?

@trozet Doesn't this create a problem when adding nodes to a live cluster?

We need to create the Chassis_Template_Vars mappings for the new node. Currently that happens because: node_tracked.updateNodeInfo() -> nt.resyncFn(nt.getZoneNodes()) -> SvcController.RequestFullSync()

We probably need to add code to do a partial sync, only for updated/added/deleted nodes. Instead of a full sync.

dceara avatar May 13 '24 14:05 dceara

good point @dceara. We still need new node add to work, but we can conditionalize the full sync on the startup lock. So I think you can do this @cathy-zhou :

+++ b/go-controller/pkg/ovn/controller/services/services_controller.go
@@ -503,14 +503,18 @@ func (c *Controller) syncNodeInfos(nodeInfos []nodeInfo) {
                }
        }
 
-       // Sync the nodeIP template values to the DB.
-       nodeIPTemplates := []TemplateMap{
-               c.nodeIPv4Templates.AsTemplateMap(),
-               c.nodeIPv6Templates.AsTemplateMap(),
-       }
-       if err := svcCreateOrUpdateTemplateVar(c.nbClient, nodeIPTemplates); err != nil {
-               klog.Errorf("Could not sync node IP templates")
-               return
+       c.startupDoneLock.RLock()
+       defer c.startupDoneLock.RUnlock()
+       if c.startupDone {
+               // Sync the nodeIP template values to the DB.
+               nodeIPTemplates := []TemplateMap{
+                       c.nodeIPv4Templates.AsTemplateMap(),
+                       c.nodeIPv6Templates.AsTemplateMap(),
+               }
+               if err := svcCreateOrUpdateTemplateVar(c.nbClient, nodeIPTemplates); err != nil {
+                       klog.Errorf("Could not sync node IP templates")
+                       return
+               }
        }
 }
 

trozet avatar May 13 '24 15:05 trozet

issue is no longer relevant because we now found several issues with service template enabled and downstream is now moving back to disable service template. Close this PR for now.

cathy-zhou avatar May 20 '24 21:05 cathy-zhou

I'll submit a new PR that includes the fix I mentioned.

trozet avatar May 21 '24 13:05 trozet