awsbeats icon indicating copy to clipboard operation
awsbeats copied to clipboard

backoff/retry policy does not influence behavior for most kinesis errors

Open lavie opened this issue 4 years ago • 0 comments

It seems that the backoff/retry configuration only applies to connection errors (e.g. no network), in which case libbeat's backoff policy kicks in and retries. However, since connect and close don't really make sense for Kinesis (and their implementation in the client is just a stub), this results in immediate retry.

So, here are a few examples of cases where the kinesis stream client enters an infinite cpu loop:

  1. When the stream's throughput is exceeded, the events are retried immediately and in most cases result in more throughput errors (bcz no backoff).
  2. When the stream's IAM permissions are missing, the error from Kinesis is permission denied, and the records are then retried immediately and infinitely, resulting in AWS API rate limiting.
  3. When kinesis is hit too frequently (e.g. because of the above) the error is Rate Limit, in which case the client simply retries immediately, exacerbating the problem.

Unclear if this is a problem to solve at:

  1. The output's level (add some retrying to put_records.
  2. The aws sdk level (it knows how to backoff in certain circumstances)
  3. The publisher

to recreate this problem, create a kinesis stream and deny putRecords permission to it to everyone, then feed a single input event to filebeat and see the cpu go to 100% and stay there.

lavie avatar Sep 25 '19 11:09 lavie