vector icon indicating copy to clipboard operation
vector copied to clipboard

Elasticsearch partial bulk failures

Open binarylogic opened this issue 6 years ago • 15 comments

I would like to think about partial failures when ingesting data into Elasticsearch and if we even want to handle this scenario. Options include:

  1. Don't do anything.
  2. Collect individual failed items and retry. This will probably require error code matching.
  3. Provide an option to dead-letter the items.

This is not urgent but I wanted to get it up for discussion and posterity.

binarylogic avatar Mar 14 '19 17:03 binarylogic

I spent some time on this as part of #139, and I think to do it right will require some rework of the util HttpSink. Right now, we don't have a good way to parse the body of an http response and inspect it in our retry logic. One option would be to introduce some kind of response mapper we pass in. Another would be to strip down the inner sink, just using it as a service and pulling up things like timeouts and retries to the level of the outer sinks. The latter would mesh well with making those things configurable per-sink, so that might be the most promising route to explore.

lukesteensen avatar Mar 21 '19 14:03 lukesteensen

Another would be to strip down the inner sink, just using it as a service and pulling up things like timeouts and retries to the level of the outer sinks.

I like that and agree

To add more context, this is definitely a later version feature and I don't think we should get cute with it. The other issue is that the response body can be very large depending on how much data was flushed, which can also cause performance and memory issues. I think a good first step is to check the root level "errors" key and then just retry the entire request. To prevent duplicate data, users can generate their own deterministic IDs which ES will reject on a duplicate requests.

binarylogic avatar Mar 21 '19 14:03 binarylogic

Noting that a user ran into this again today: https://discord.com/channels/742820443487993987/746070591097798688/775478035738525707

Also noting that @fanatid attempted something like @lukesteensen mentioned with retrying partial failures in the #2755 sink, though it looks like we'll be reverting and reintroducing that handling later.

@binarylogic I'd tag this as a have: must unless I'm missing something. I feel like this is a fairly common failure mode for ES to reject individual records and, right now, these events just fall on floor without even any indication in the logs. I agree that just retrying the whole request would be better than nothing given we are using the index action and id conflicts would be handled.

jszwedko avatar Nov 09 '20 22:11 jszwedko

Yeah, there are a few reasons I marked it as should:

  1. Logstash does not retry partial failures.
  2. The status code returned by Elasticsearch is 201.
  3. The most common reason for partial failures is events that violate the schema, which can be ignored via the ignore_malformed elasticsearch setting.
  4. We can't retry the entire request unless we also set the _id field to avoid inserting duplicate data.

Given the above, I'm wondering if partial retries are the best path forward as opposed to a dead-letter sink. There will be circumstances where partial retries will never succeed.

binarylogic avatar Nov 09 '20 22:11 binarylogic

Those are fair reasons, there are certainly cases where partial retries will never succeed and it would be good to have dead-letter support.

However, in the case mentioned in discord, the none of the inserts in the request were successful, and would never succeed on a retry, and yet we were continuing to process events without note. Depending on the source the events are coming from, this could require some work on the user's part to replay messages once they noticed that none were making it into ES.

At the least, I think we should be logging and reporting a metric for failed inserts into ES.

jszwedko avatar Nov 10 '20 15:11 jszwedko

At the least, I think we should be logging and reporting a metric for failed inserts into ES.

Agree, and @bruceg implemented logging for partial failures in https://github.com/timberio/vector/pull/2185.

binarylogic avatar Nov 10 '20 15:11 binarylogic

It seems like that was dropped along the way because it isn't in master:

https://github.com/timberio/vector/blob/a789e42159cc8a617da0c2f8f7df8f5de91fca06/src/sinks/elasticsearch.rs#L300-L326

EDIT never mind, I see it was just moved :smile:

I'll try this out as it seemed like the user did not see any errors.

jszwedko avatar Nov 10 '20 15:11 jszwedko

Yeah, and clearly it would be nice to have a test for this if there's a way.

binarylogic avatar Nov 10 '20 15:11 binarylogic

Indeed, I was mistaken, apologies for the goose chase. I do see errors:

Nov 10 16:11:52.532  INFO vector::app: Log level is enabled. level="info"
Nov 10 16:11:52.534  INFO vector::app: Loading configs. path=["/home/jesse/workspace/vector-configs/elasticsearch.toml"]
Nov 10 16:11:52.560  INFO vector::sources::stdin: Capturing STDIN.
Nov 10 16:11:52.593  INFO vector::topology: Running healthchecks.
Nov 10 16:11:52.596  INFO vector::topology: Starting source. name="stdin"
Nov 10 16:11:52.596  INFO vector::topology: Starting sink. name="elasticsearch"
Nov 10 16:11:52.596  INFO vector: Vector has started. version="0.11.0" git_version="v0.9.0-1298-g76bb25c" released="Tue, 10 Nov 2020 16:02:23 +0000" arch="x86_64"
Nov 10 16:11:52.602  INFO vector::topology::builder: Healthcheck: Passed.
Nov 10 16:11:53.816 ERROR sink{component_kind="sink" component_name=elasticsearch component_type=elasticsearch}:request{request_id=0}: vector::sinks::util::retries: Not retriable; dropping the request. reason="error type: illegal_argument_exception, reason: only write ops with an op_type of create are allowed in data streams"
Nov 10 16:11:54.754 ERROR sink{component_kind="sink" component_name=elasticsearch component_type=elasticsearch}:request{request_id=1}: vector::sinks::util::retries: Not retriable; dropping the request. reason="error type: illegal_argument_exception, reason: only write ops with an op_type of create are allowed in data streams"
Nov 10 16:11:55.873 ERROR sink{component_kind="sink" component_name=elasticsearch component_type=elasticsearch}:request{request_id=2}: vector::sinks::util::retries: Not retriable; dropping the request. reason="error type: illegal_argument_exception, reason: only write ops with an op_type of create are allowed in data streams"
Nov 10 16:11:56.801 ERROR sink{component_kind="sink" component_name=elasticsearch component_type=elasticsearch}:request{request_id=3}: vector::sinks::util::retries: Not retriable; dropping the request. reason="error type: illegal_argument_exception, reason: only write ops with an op_type of create are allowed in data streams"
Nov 10 16:11:57.895 ERROR sink{component_kind="sink" component_name=elasticsearch component_type=elasticsearch}:request{request_id=4}: vector::sinks::util::retries: Not retriable; dropping the request. reason="error type: illegal_argument_exception, reason: only write ops with an op_type of create are allowed in data streams"

jszwedko avatar Nov 10 '20 16:11 jszwedko

We were trying to using Vector as a server which will read data from Kakfa and push into elasticsearch, but right now vector is not support mapping conflicts/error handling. We dont want to loose errored out events. There should be config which will allow to configure DLQ which will allow to manipulate errored out events with the help of transforms and reprocess them.

narendraingale2 avatar Jul 05 '21 03:07 narendraingale2

If partial retries are too hard to implement for now I would like to have an option for just retrying the full request (which is completely safe if setting the _id field) if that would help to get this implemented more quickly.

Thor77 avatar Mar 01 '22 15:03 Thor77

Is there any progress here?

ojh3636 avatar May 16 '22 11:05 ojh3636

Is there any progress here?

Not yet, unfortunately, but it is on our radar.

jszwedko avatar May 16 '22 23:05 jszwedko

@jszwedko what is the current location of this issue on your radar? :)

zamazan4ik avatar Sep 16 '22 17:09 zamazan4ik

@jszwedko what is the current location of this issue on your radar? :)

This may be something we tackle in Q4. We'll likely start with the suggested approach of just retrying the whole payload at first and do something more sophisticated in the future to only retry failed events.

jszwedko avatar Sep 16 '22 19:09 jszwedko