vector icon indicating copy to clipboard operation
vector copied to clipboard

Add jitter to HTTP retry behavior

Open jszwedko opened this issue 3 years ago • 10 comments

Spun off from https://github.com/timberio/vector/issues/3998#issuecomment-715612698

Bruce observed that for 429 (Too Many Request) responses that we are retrying these as fast as we can. I think it is more typical in clients to back-off under these conditions.

Additionally, if an Retry-After header is present, we could back-off until such time.

We could consider also doing this for 500s or other signals.

jszwedko avatar Oct 23 '20 22:10 jszwedko

We did attempt to implement Fibonacci backoffs as part of https://github.com/timberio/vector/pull/1006, but it's definitely worth investigating why we did not observe that behavior. It's possible that 429s don't follow that path since they're a client error?

lukesteensen avatar Oct 23 '20 22:10 lukesteensen

I have been unable to simply reproduce a lack of backoff. An individual request that encounters an HTTP 429 or 503 response is retried after a delay, and the delay increases according to a fibonacci sequence as specified by #1006. However, each additional request is tried immediately even if other previous requests are being delayed. This explains behavior that may look like a "no backoff" situation.

I am not clear if there is a good path to improving this situation without causing bigger problems, or even if this behavior is undesirable. For example, we could set a flag (somewhere?) when we hit the first 429 and not try any more requests until one of them goes through, but that fully stops our deliveries for periods, which could kill the delivery rate.

I recommend closing this and reopening if our behavior is causing problems/complaints.

bruceg avatar Nov 12 '20 23:11 bruceg

Aha, I see, that makes sense. I agree that modifying that behavior is likely to be an invasive change.

Maybe we could add some jitter so they at least don't fire off all at the same time? It seems like it is a fixed back-off: https://github.com/timberio/vector/pull/1006/files#diff-6ede802cfbd8c6cafa00862e178bad867ac0960838e589bd584dfdacbfe5021aR53

jszwedko avatar Nov 17 '20 21:11 jszwedko

I wouldn't argue with adding jitter, but that would be a different issue. I also question if that would help any in real usage, as each request has its own backoff time and the generation of the requests would be dependent on data from a source, which would not come in all at the same time.

bruceg avatar Nov 18 '20 20:11 bruceg

I also question if that would help any in real usage, as each request has its own backoff time and the generation of the requests would be dependent on data from a source, which would not come in all at the same time.

True, though if there is a large amount of buffered data that is split up into N requests starting close to the same time potentially resulting in the requests retrying at the same intervals (especially if it is the number of concurrent requests itself that is causing the upstream failure). We could see this if we have a number of sources that scrape on the same interval, e.g., and send to the same sinks.

https://aws.amazon.com/builders-library/timeouts-retries-and-backoff-with-jitter/#Jitter has some motivating content.

I agree though, this is a separate issue. I'm happy to close this one and open one for adding jitter.

jszwedko avatar Nov 18 '20 22:11 jszwedko

Noting that this has been idle for a year now, so it is low priority.

bruceg avatar Apr 16 '21 19:04 bruceg

Hi, I am experiencing something related to this I think when an agent cannot connect to the HTTP sink. The vector agent then starts logging Request timed out messages and is filling up logs and causing disks to fill.

In a scenario like this such as a network partition it would be better I think to have a global backoff for all requests to an affected sink.

HTTP sync config:

[sinks.http]
compression = "gzip"
encoding = "json"
inputs = ["journald"]
type = "http"
uri = "https://my.log.server:5001"

Example logs:

Sep 01 08:39:40 my.host vector[25037]: Sep 01 08:39:40.841  WARN sink{component_id=http component_kind="sink" component_type=http component_name=http}:request{request_id=118}: vector::sinks::util::retries: Request timed out. If this happens often while the events are actually reaching their destination, try decreasing `batch.max_bytes` and/or using `compression` if applicable. Alternatively `request.timeout_secs` can be increased.
Sep 01 08:39:41 my.host vector[25037]: Sep 01 08:39:41.842  WARN sink{component_id=http component_kind="sink" component_type=http component_name=http}:request{request_id=44}: vector::sinks::util::retries: Request timed out. If this happens often while the events are actually reaching their destination, try decreasing `batch.max_bytes` and/or using `compression` if applicable. Alternatively `request.timeout_secs` can be increased.
Sep 01 08:39:42 my.host vector[25037]: Sep 01 08:39:42.187  WARN sink{component_id=http component_kind="sink" component_type=http component_name=http}:request{request_id=19}: vector::sinks::util::retries: Request timed out. If this happens often while the events are actually reaching their destination, try decreasing `batch.max_bytes` and/or using `compression` if applicable. Alternatively `request.timeout_secs` can be increased.
Sep 01 08:39:42 my.host vector[25037]: Sep 01 08:39:42.455  WARN sink{component_id=http component_kind="sink" component_type=http component_name=http}:request{request_id=90}: vector::sinks::util::retries: Request timed out. If this happens often while the events are actually reaching their destination, try decreasing `batch.max_bytes` and/or using `compression` if applicable. Alternatively `request.timeout_secs` can be increased.
Sep 01 08:39:42 my.host vector[25037]: Sep 01 08:39:42.455  WARN sink{component_id=http component_kind="sink" component_type=http component_name=http}:request{request_id=119}: vector::sinks::util::retries: Request timed out. If this happens often while the events are actually reaching their destination, try decreasing `batch.max_bytes` and/or using `compression` if applicable. Alternatively `request.timeout_secs` can be increased.
Sep 01 08:39:42 my.host vector[25037]: Sep 01 08:39:42.455  WARN sink{component_id=http component_kind="sink" component_type=http component_name=http}:request{request_id=65}: vector::sinks::util::retries: Request timed out. If this happens often while the events are actually reaching their destination, try decreasing `batch.max_bytes` and/or using `compression` if applicable. Alternatively `request.timeout_secs` can be increased.

gsalisbury avatar Aug 31 '21 22:08 gsalisbury

Hi. any updates ? I am experiencing the same issue.

LinTechSo avatar Aug 06 '22 10:08 LinTechSo

Hi @LinTechSo !

No updates yet. It seems like the correct path here is to add some jitter to the retries. We'd be happy to accept a PR for this.

jszwedko avatar Aug 12 '22 18:08 jszwedko

We have seen issues related to this. The use case is AWS service (ELB, NFW, etc) logs being written to S3. These logs can be 25MB each, and we wire it up to send SQS notifications, then have vector listening to those and pulling down and processing the log files to be sent to loki. Loki has per stream rate limiting built in, and sends 429 response codes to vector when it receives too many requests too quickly. vector's adaptive concurrency (on by default) makes this the norm for this case (processing a large batch of messages) since it slowly ramps up the number of concurrent requests and doesn't consider 429s as an indicator to use less.

My understanding of what is happening: the settings in loki are 3MB/s rate limit, which means a 25MB batch could be processed in <10seconds. However vector's adaptive concurrency tries to shove all the data through at once, gets mostly 429s back, then retries all the 429 requests at the same time and it happens again and again. We have had to disable the adaptive request concurrency for these use cases, which is sad since that means we are now manually managing the settings.

I would love to see some jitter added. Or perhaps even better from my perspective see the adaptive concurrency mechanism become more aware of 429s, which are a standard indicator that less data should be sent. I think lowering concurrency in that case would be appropriate.

fletcm avatar Jan 04 '23 20:01 fletcm