ovn-kubernetes
ovn-kubernetes copied to clipboard
Load balancer failed to be created
Load balancer failed to be created because it took long time to initialize nodeTracker in case there were lots of nodes. As the result, service controller handler go routine timed out after 20 seconds and returned error.
This commit increases HandlerSyncTimeout to 3 minutes, and it would be a fatal error if service controller fails to start.
- What this PR does and why is it needed
- Special notes for reviewers
- How to verify it
- Description for the changelog
Deploy Preview for subtle-torrone-bb0c84 ready!
Name | Link |
---|---|
Latest commit | cb208b23e5c6ff728d95267f81afc3a2e2579cbf |
Latest deploy log | https://app.netlify.com/sites/subtle-torrone-bb0c84/deploys/66171f0497247c000872c8d3 |
Deploy Preview | https://deploy-preview-4270--subtle-torrone-bb0c84.netlify.app |
Preview on mobile | Toggle QR Code...Use your smartphone camera to open QR code link. |
To edit notification comments on pull requests, go to your Netlify site configuration.
/lgtm Let's rerun CI though /retest-failed (might not work)
Oops, something went wrong:
Must have admin rights to Repository.
I'm OK with bumping the timeout, but would it be more prudent to run pprof in your setup and find out what is taking so long?
@winsopc has kindly helped me to profile and found out that svcCreateOrUpdateTemplateVar() funciton is the bottleneck. Note that if we disable svc template, nodeTracker initialization of the same zone only took 2 seconds compared to 55 seconds if svc template is enabled. @trozet
coverage: 52.422%. remained the same when pulling d979321b5bbac43583a579bc2ffdfa7a78e9a680 on cathy-zhou:SDN-2653 into 12c5f08dff40bc0c74983b39a3e3786e5a18c7c8 on ovn-org:master.
thanks @cathy-zhou @winsopc . Do either of you happen to have the pprof file or graph output you can share? I'm guessing the problem is this function:
func forEachNBTemplateInMaps(templateMaps []TemplateMap, callback func(nbTemplate *nbdb.ChassisTemplateVar) bool) {
// First flatten the maps into *nbdb.ChassisTemplateVar records:
flattened := ChassisTemplateVarMap{}
for _, templateMap := range templateMaps {
for _, template := range templateMap {
for chassisName, templateValue := range template.Value {
nbTemplate, found := flattened[chassisName]
if !found {
nbTemplate = &nbdb.ChassisTemplateVar{
Chassis: chassisName,
Variables: map[string]string{},
}
flattened[chassisName] = nbTemplate
}
nbTemplate.Variables[template.Name] = templateValue
}
}
}
// Now walk the flattened records:
for _, nbTemplate := range flattened {
if !callback(nbTemplate) {
break
}
}
}
@dceara FYI
So the code @ricky-rav added skipped the service add stuff during RequestFullSync on initial start up, but it did not skip c.syncNodeInfos(nodeInfos): https://github.com/ovn-org/ovn-kubernetes/blob/master/go-controller/pkg/ovn/controller/services/services_controller.go#L527
This ends up calling svcCreateOrUpdateTemplateVar for all nodeInfos: https://github.com/ovn-org/ovn-kubernetes/blob/master/go-controller/pkg/ovn/controller/services/services_controller.go#L511
Which calls svcCreateOrUpdateTemplateVarOps, and then forEachNBTemplateInMaps.
I think the fix here is probably to skip svcCreateOrUpdateTemplateVar in syncNodeInfos, and just let the normal service Add take care of ensuring the load balancer templates are correct in OVN. @dceara wdyt?
In the meantime @cathy-zhou @winsopc could you try removing lines 506-514, from syncNodeInfos and see if it solves your issue?
https://drive.google.com/file/d/1SxyNlqJ-93mDv4mFvVzJNZ0JP0h4eKNL/view?usp=sharing @trozet
@winsopc thanks. I think that proves the theory. Can you or @cathy-zhou try removing those lines I mentioned and see how it goes?
@trozet with the lines removed, nodeTracker initialization of the same zone took less than 2 seconds. but that is for experimental only, right? as it would fail to create template for nodeIPs, and nodePort service would fail.
So the code @ricky-rav added skipped the service add stuff during RequestFullSync on initial start up, but it did not skip c.syncNodeInfos(nodeInfos): https://github.com/ovn-org/ovn-kubernetes/blob/master/go-controller/pkg/ovn/controller/services/services_controller.go#L527
This ends up calling svcCreateOrUpdateTemplateVar for all nodeInfos: https://github.com/ovn-org/ovn-kubernetes/blob/master/go-controller/pkg/ovn/controller/services/services_controller.go#L511
Which calls svcCreateOrUpdateTemplateVarOps, and then forEachNBTemplateInMaps.
I think the fix here is probably to skip svcCreateOrUpdateTemplateVar in syncNodeInfos, and just let the normal service Add take care of ensuring the load balancer templates are correct in OVN. @dceara wdyt?
@trozet Doesn't this create a problem when adding nodes to a live cluster?
We need to create the Chassis_Template_Vars mappings for the new node. Currently that happens because:
node_tracked.updateNodeInfo() -> nt.resyncFn(nt.getZoneNodes()) -> SvcController.RequestFullSync()
We probably need to add code to do a partial sync, only for updated/added/deleted nodes. Instead of a full sync.
good point @dceara. We still need new node add to work, but we can conditionalize the full sync on the startup lock. So I think you can do this @cathy-zhou :
+++ b/go-controller/pkg/ovn/controller/services/services_controller.go
@@ -503,14 +503,18 @@ func (c *Controller) syncNodeInfos(nodeInfos []nodeInfo) {
}
}
- // Sync the nodeIP template values to the DB.
- nodeIPTemplates := []TemplateMap{
- c.nodeIPv4Templates.AsTemplateMap(),
- c.nodeIPv6Templates.AsTemplateMap(),
- }
- if err := svcCreateOrUpdateTemplateVar(c.nbClient, nodeIPTemplates); err != nil {
- klog.Errorf("Could not sync node IP templates")
- return
+ c.startupDoneLock.RLock()
+ defer c.startupDoneLock.RUnlock()
+ if c.startupDone {
+ // Sync the nodeIP template values to the DB.
+ nodeIPTemplates := []TemplateMap{
+ c.nodeIPv4Templates.AsTemplateMap(),
+ c.nodeIPv6Templates.AsTemplateMap(),
+ }
+ if err := svcCreateOrUpdateTemplateVar(c.nbClient, nodeIPTemplates); err != nil {
+ klog.Errorf("Could not sync node IP templates")
+ return
+ }
}
}
issue is no longer relevant because we now found several issues with service template enabled and downstream is now moving back to disable service template. Close this PR for now.
I'll submit a new PR that includes the fix I mentioned.