Terraform's `kubectl wait --for=condition=ready pods ...` hangs
Describe the bug
- According to this Google-internal Doc, Online Boutique's default Terraform (in /main/terraform) sometimes hangs.
- More specifically, it's the "kubectl wait" condition inside main.tf that hangs.
To Reproduce
- I have not reproduced the issue myself yet.
- But we can try to reproduce this issue by running the
terraform apply(as seen in /terraform/README.md) multiple times.
Logs
- N/A
Screenshots
- N/A
Environment
- TBD
Additional context
- tpryan@ might be able to provide additional context.
Exposure
- This would affect anyone using Terraform to deploy Online Boutique.
@tpryan Is my assumption correct: you witnessed this issue when Cloud Build tried running this line?
Yes. But it also occurred during manual tests - ie calling terraform directly, and not in the test script.
It's intermittent. And I can't tell what variables are changing. I have a build job that runs it every night. So in theory I will get logs for you to sift through eventually. :)
The configuration enforces kubectl wait to wait forever. In this condition, if the cluster does not have sufficient resources, the command indeed will run forever. I suggest to change and check the returned status. Note that the --timeout parameter defines waiting time for each pod. So, in theory it can wait for timeout * number_of_pods time.
@NimJay i suggest to close this item. there is no clear reproduction path. as an option, we can add a timeout argument to the relevant Terraform configurations to avoid tests to hang out forever.
Issue Disappeared
- I took a quick look at the last 100 "build histories" of the DeployStack tests.
- Each test attempts to run Online Boutique's Terraform. The Cloud Build Triggers are configured by /.deploystack/test.yaml and /.deploystack/test.
- 3 of the 100 builds failed, but those 3 failures are not related to the
kubectl waitissues reports in this issue. - It should be safe to close this issue now (as per @minherz's suggestion).