Adding null check for VTEPInfo and IPaddress
Description
Missing both IPv4 and IPv6 VTEP information panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation] github.com/projectcalico/calico/felix/calc.l3rrNodeInfo.AddressAsCIDR
/go/src/github.com/projectcalico/calico/felix/calc/l3_route_resolver.go:166
Todos
- [ ] Tests
- [ ] Documentation
- [ ] Release note
Release Note
Adding nil checks
Reminder for the reviewer
Make sure that this PR has the correct labels and milestone set.
Every PR needs one docs-* label.
docs-pr-required: This change requires a change to the documentation that has not been completed yet.docs-completed: This change has all necessary documentation completed.docs-not-required: This change has no user-facing impact and requires no docs.
Every PR needs one release-note-* label.
release-note-required: This PR has user-facing changes. Most PRs should have this label.release-note-not-required: This PR has no user-facing changes.
Other optional labels:
cherry-pick-candidate: This PR should be cherry-picked to an earlier release. For bug fixes only.needs-operator-pr: This PR is related to install and requires a corresponding change to the operator.
@ajgupta42 Please could you add more information:
- the stack trace of the panic that you saw
- any ideas about why this is happening in your cluster - bearing in mind that we've missed this for some reason in our existing testing.
@nelljerram It was failing on AsCIDR, so added some nil checks to prevent seg fault https://github.com/projectcalico/calico/pull/11028/files#diff-dfd7a82ed33eb6795753c3c8c92bddf7470585efe073b284384c4a11a557e8f8R171
IP address was blank for a node which caused this issue.
Thanks @ajgupta42 but I don't fully understand yet.
- Can you provide the full stack trace? That will help me understand the scenario overall.
- What exactly do you mean by "IP address was blank for a node"?
Thanks @nelljerram for the quick response, highly appreciated. Please find the full stack trace
goroutine 582 [runningl:
github.com/projectcalico/calico/felix/calc.13rrNodelnfo.AddressesAsCIDRs(f{0xa, Oxa, Ox20, 0x18}, {(0xa, Oxa, 0x20, 0x0}, Ox1a}, (ОxО,...])...)
/go/src/github.com/projectcalico/calico/felix/calc/13_route_resolver.go:166 +0x309
github.com/projectcalico/calico/felix/calc.(*L3RouteResolver).onNodeUpdate(0xc00103e990, {0xc000bd352d, Oxf), Oxc001319978)
/go/src/github.com/projectcalico/calico/felix/calc/13_route_resolver.go:578 +0xe25
github.com/projectcalico/calico/felix/calc.(*L3RouteResolver).OnResourceUpdate(0xc00103e990, ii(0x509faa0, 0xc001410630}, (0x488d780, 0xc000a036c0}, {0xc001164900,
0x9}, Ox0, OxO}, Ox1})
/go/src/github.com/projectcalico/calico/felix/calc/13_route_resolver.go:442 +0xae5
github.com/projectcalico/calico/felix/dispatcher.updateHandlers.DispatchToAlI(..)
/go/src/github.com/projectcalico/calico/felix/dispatcher/dispatcher.go:45
github.com/projectcalico/calico/felix/dispatcher.(*Dispatcher).OnUpdate(0xc000072f20, (l{0x509faaO, 0xc001410630}, {0x488d780,
Oxc000a036c0}, (0xc001164900, 0x9}, 0x0,
OxO}, 0x1))
/go/src/github.com/projectcalico/calico/felix/dispatcher/dispatcher.go:105 +0x395
github.com/projectcalico/calico/felix/dispatcher.(*Dispatcher).OnUpdates(...)
/go/src/github.com/projectcalico/calico/felix/dispatcher/dispatcher.go:80
github.com/projectcalico/calico/felix/calc. (*CalcGraph).OnUpdates(...)
/go/src/github.com/projectcalico/calico/felix/calc/calc_graph.go:137
github.com/projectcalico/calico/felix/calc.(*AsyncCalcGraph).loop(0xc0006a5500)
/go/src/github.com/projectcalico/calico/felix/calc/async_calc_graph.go:151 +0x75c
created by github.com/projectcalico/calico/felix/calc.(*AsyncCalcGraph).Start in goroutine 1 /go/src/github.com/projectcalico/calico/felix/calc/async_calc_graph.go:259 +0xf8
What exactly do you mean by "IP address was blank for a node"?
> This was an edge-case scenario. Due to a bug, the node’s ExternalIP became blank (it wasn’t set to <none>). We had to manually clean it up & set to <none>. I believe we can handle this with a nil check. Please let me know if I’m missing anything or if there are other edge cases we should cover.
@ajgupta42 Thank you for your last response. I'm afraid I'm still a little unclear about the scenario, though, and I also realized that we should have a test for it - which would be an excellent way of clarifying the scenario. It sounds like an FV test would be appropriate for this situation, so could you look at adding an FV test? Please let me know if you need more info about how our FV tests work.
@ajgupta42 You've added UTs - thank you - but unfortunately they don't help to explain the overall scenario, and that's why I suggested an FV test.
This PR is stale because it has been open for 60 days with no activity.