burrito icon indicating copy to clipboard operation
burrito copied to clipboard

fix(layer): controller retries indefinitely when layer is in error

Open AlanLonguet opened this issue 1 year ago • 2 comments

The 0.4 release introduces a regression on the retry behavior of TerraformLayer

AlanLonguet avatar Jun 03 '24 12:06 AlanLonguet

We're in such case with burrito 0.6.4-0.6.5. I can see controller logs saying:

time="2025-04-28T09:49:08Z" level=info msg="run ops-dvs-oaas-machines-int-fr1-plan-nkscd has not reach retry limit, retrying..."

I also tried to set BURRITO_CONTROLLER_TERRAFORMMAXRETRIES=3 but with no effect, I have pods being recreated forever. The failing terragrunt layer exits rc=1.

Could we re-open this issue @corrieriluca? I'll also try do dig on my side but if it rings you a bell already, that could help 🙏

michael-todorovic avatar Apr 28 '25 10:04 michael-todorovic

I did a bit of digging into the issue:

  • The BURRITO_CONTROLLER_TERRAFORMMAXRETRIES parameter which defines the maximum number of retrying runners pods per TerraformRun object is working.

The problem is that the layer controller indefinetely creates a new TerraformRun object when the maximum of runner pods is reached, leading to the creation of new pods.

The intended behavior was to have an exponential backoff timer when creating the subsequent TerraformRun objects when they end up in "failing" status, but i can't find its implementation.

We also have to take into account that controller runtime now implements exponential backoff out of the box when a reconciliation loop returns an error ; hence when the r.Config.Controller.Timers.OnError is not taken into account anymore when returning ctrl.Result{RequeueAfter: r.Config.Controller.Timers.OnError}, err in a reconciliation loop.

LucasMrqes avatar Apr 28 '25 10:04 LucasMrqes