fix(layer): controller retries indefinitely when layer is in error
The 0.4 release introduces a regression on the retry behavior of TerraformLayer
We're in such case with burrito 0.6.4-0.6.5. I can see controller logs saying:
time="2025-04-28T09:49:08Z" level=info msg="run ops-dvs-oaas-machines-int-fr1-plan-nkscd has not reach retry limit, retrying..."
I also tried to set BURRITO_CONTROLLER_TERRAFORMMAXRETRIES=3 but with no effect, I have pods being recreated forever. The failing terragrunt layer exits rc=1.
Could we re-open this issue @corrieriluca? I'll also try do dig on my side but if it rings you a bell already, that could help 🙏
I did a bit of digging into the issue:
- The
BURRITO_CONTROLLER_TERRAFORMMAXRETRIESparameter which defines the maximum number of retrying runners pods per TerraformRun object is working.
The problem is that the layer controller indefinetely creates a new TerraformRun object when the maximum of runner pods is reached, leading to the creation of new pods.
The intended behavior was to have an exponential backoff timer when creating the subsequent TerraformRun objects when they end up in "failing" status, but i can't find its implementation.
We also have to take into account that controller runtime now implements exponential backoff out of the box when a reconciliation loop returns an error ; hence when the r.Config.Controller.Timers.OnError is not taken into account anymore when returning ctrl.Result{RequeueAfter: r.Config.Controller.Timers.OnError}, err in a reconciliation loop.