terraform-provider-vsphere
terraform-provider-vsphere copied to clipboard
Race condition in WaitForGuestNet?
Terraform Version
0.12.28
vSphere Provider Version
v1.18.3
Affected Resource(s)
-
vsphere_virtual_machine
Terraform Configuration Files
N/A
Debug Output
This debug output was obtained by running a forked version with extra debug added. You can see this version here, only changes are all the log output: https://github.com/Itxaka/terraform-provider-vsphere/blob/v1.18.3.debug/vsphere/internal/helper/virtualmachine/virtual_machine_helper.go#L275
Output from a failed run below: https://gist.github.com/Itxaka/007124e6f6be589a7279ed39cd8eb0dd
Panic Output
None
Expected Behavior
WaitForGuestNet properly behaves when the gateway is obtained later than the IP address.
Actual Behavior
If you get IP info before you get the gateway info, the check for the routable wont ever complete due to how the code is written.
ArrayOfGuestNicInfo
obtains the IP info and tries to compare it to the gateway mask but both gateways are nil
because we havent got that info yet. That stops the whole WaitForGuestNet
processing as it will only fire once there is a change, and there wont be any more changes to ArrayOfGuestNicInfo
. It just stays on a limbo there after one failure to compare the masks.
Lets go step by step.
- 1 -
WaitForGuestNet
is run and it launches aclient.PropertyCollector().wait()
[0] in order to detect changes in the VMguest.net
andguest.ipStack
properties as to detect IP/route changes - 2 - For each detected change, it goes into a switch statement[1] and depending if its
ArrayOfGuestStackInfo
orArrayOfGuestNicInfo
its dealt different - 3A - For
ArrayOfGuestStackInfo
[2] we access the data atIpRouteConfig.IpRoute.Network
to try to find either0.0.0.0
or::
and in case its found we parseIpRouteConfig.IpRoute.Network.Gateway.IpAddress
to obtain the gateway address (ipv4/6)[3][4] - 3B - For
ArrayOfGuestNicInfo
[5] we access the data atIpConfig.IpAddress
to obtain the IP, and once we got it we get the mask for that IP[6] and try to make sure it is the same as the gateway mask[7] - 4 - If the masks are the same, we return true meaning that the IP is there AND routable, otherwise we keep waiting until it matches or we time out (5 minutes)
Now the problem according to the logs up here is that step 3A does not obtain a gateway at first so the values for v4gw and v6gw are empty but step 3B runs and fails due to those values being empty but it never retries again...because why would it?
This client.PropertyCollector().wait()
only fires up once there is a change, and only if the type matches it will rerun the proper path, but the values that fire up the check again not change anymore so that path is never executed again, leading to the timeout.
This is perfectly reflected in the log:
ArrayOfGuestNicInfo
is run first but ipv4/ipv6 gateways are empty:
2020-07-08T16:24:47.300+0200 [DEBUG] plugin.terraform-provider-vsphere_v1.18.3_x4: 2020/07/08 16:24:47 [DEBUG]["itxaka-master-14"] IP "10.164.93.152" checked against gateways: ipv4 -> "
ArrayOfGuestStackInfo
is run afterwards and fills the proper ipv4 gateways but the process that should check the masks have already run once and will not run again:
2020-07-08T16:25:17.058+0200 [DEBUG] plugin.terraform-provider-vsphere_v1.18.3_x4: 2020/07/08 16:25:17 [DEBUG]["itxaka-master-14"] Got ipv4 gateway: "10.164.80.1"
[0] hashicorp/terraform-provider-vsphere:vsphere/internal/helper/virtualmachine/[email protected]#L292 [1] hashicorp/terraform-provider-vsphere:vsphere/internal/helper/virtualmachine/[email protected]#L298 [2] hashicorp/terraform-provider-vsphere:vsphere/internal/helper/virtualmachine/[email protected]#L312 [3] hashicorp/terraform-provider-vsphere:vsphere/internal/helper/virtualmachine/[email protected]#L305 [4] hashicorp/terraform-provider-vsphere:vsphere/internal/helper/virtualmachine/[email protected]#L307 [5] hashicorp/terraform-provider-vsphere:vsphere/internal/helper/virtualmachine/[email protected]#L312 [6] hashicorp/terraform-provider-vsphere:vsphere/internal/helper/virtualmachine/[email protected]#L328 [7] hashicorp/terraform-provider-vsphere:vsphere/internal/helper/virtualmachine/[email protected]#L332
Im guessing that the check for mask should be outside of the wait
in a different wait
that ....waits for either of the gateway vars to be filled AND the IP values. Otherwise checking values without knowing if they are gonna be filled is just asking for trouble :D
Steps to Reproduce
We can usually trigger this when our vCenter is launching several instances at the same time. But basically its due to getting the gateway later than the IP info.
Important Factoids
Nope
References
- #0000
Community Note
- Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
- Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request
- If you are interested in working on this issue or have submitted a pull request, please leave a comment