trafficcontrol icon indicating copy to clipboard operation
trafficcontrol copied to clipboard

Fix t3c timeouts and retries

Open rob05c opened this issue 2 years ago • 3 comments

Fixes bugs in t3c of both timeouts and retries to Traffic Ops. Config values of the Traffic Ops client weren't getting set correctly, resulting in no retries and timeouts far longer than configured.

This also adds info logging, which should make it easier to see if timeouts or retries somehow get the wrong value in the future.

Which Traffic Control components are affected by this PR?

  • Traffic Control Cache Config (t3c, formerly ORT)

What is the best way to verify this PR?

Run t3c, verify config is generated and applied as expected. Run t3c against a Traffic Ops somehow set up or configured to not respond or to initially respond with errors, verify failures are retried and timeouts are as configured. You can also partially verify, by observing the new timeout and retry info logs.

If this is a bugfix, which Traffic Control versions contained the bug?

Examples:

  • 6.0.0
  • 6.1.0
  • 7.0.0 (RC0)
  • 7.0.0 (RC1)
  • 7.0.0 (RC2)

PR submission checklist

  • ~[x] This PR has tests~ no tests, bugs are in the applications themselves configuring the library, not any library, so tests would require integration tests with an extensive ability to configure bad and slow replies in Traffic Ops, which don't exist and would be a large development effort.
  • ~[x] This PR has documentation~ no docs, no interface change
  • [x] This PR has a CHANGELOG.md entry
  • [x] This PR DOES NOT FIX A SERIOUS SECURITY VULNERABILITY (see the Apache Software Foundation's security guidelines for details)

rob05c avatar Aug 04 '22 20:08 rob05c

So, exponential backoff retrying is certainly better than non-exponential, but it's really important for there to be jitter in the sleep intervals as well. Otherwise, clients will all retry at the same time still. For example, if a particular sleep window is 1 minute long, the client should sleep some random about between 0 and 1 minute before retrying. This prevents all retries from happening right at that 1 minute mark.

Also, since we're running t3c every 1 - 2 minutes already without retries, I think it would make sense for the longest sleep window to be something like 4 or 8 minutes. We should also provide the ability to cap the window to some maximum like 10 minutes.

rawlinp avatar Aug 05 '22 18:08 rawlinp

Otherwise, clients will all retry at the same time still

t3c runs should already be dispersed. The ATC cache-config itself doesn't do that, users need to make their script do that, but I think the Ansible in the repo has an example, and it's not difficult.

Also, since we're running t3c every 1 - 2 minutes already without retries, I think it would make sense for the longest sleep window to be something like 4 or 8 minutes. We should also provide the ability to cap the window to some maximum like 10 minutes.

Users can run t3c as often as they want, the ATC cache-config doesn't really mandate or recommend a frequency. I think we generally suggest somewhere between 1-15mins, depending on the user's scale and TO resources. But I'm not sure even that recommendation is documented anywhere.

But you make a good point, the backoff should be configurable. I agree, but I was trying to keep this PR small, was planning to do that in a subsequent PR. But I can add it to this PR if you feel it's necessary.

rob05c avatar Aug 05 '22 18:08 rob05c

t3c runs should already be dispersed. The ATC cache-config itself doesn't do that, users need to make their script do that, but I think the Ansible in the repo has an example, and it's not difficult.

I know that the runs themselves are already dispersed, but I'm saying that in the situation where retries are necessary (e.g. TODB is overloaded), the existing dispersion is not necessarily going to be enough. By adding random jitter, it's like expanding the dispersion window dynamically, as opposed to having to edit the cron job frequency in the middle of an outage scenario.

For example, let's say the dispersion window is big enough to where only 100 out of 3000 clients total are making requests to TO at any given point in time, but they all encounter an error and need to retry. Since they all started at the same time, they're all going to retry at the same time (even though it is only 100 out of 3000), which is not ideal. If TODB couldn't handle the load of 100 concurrent requests for some reason, then it still wouldn't be able to handle these 100 concurrent retries.

However, by giving the clients random jitter within that sleep interval, as long as the maximum sleep interval is longer than the normal cron frequency, we'd actually be making the requests more dispersed in an outage situation, without having to change cron frequencies, which is what we want. That's why I think making the "default" retries include up to a final 4-8 minute sleep, along with random jitter, can really help us out.

rawlinp avatar Aug 05 '22 21:08 rawlinp