vector
vector copied to clipboard
Elasticsearch partial bulk failures
I would like to think about partial failures when ingesting data into Elasticsearch and if we even want to handle this scenario. Options include:
- Don't do anything.
- Collect individual failed items and retry. This will probably require error code matching.
- Provide an option to dead-letter the items.
This is not urgent but I wanted to get it up for discussion and posterity.
I spent some time on this as part of #139, and I think to do it right will require some rework of the util HttpSink. Right now, we don't have a good way to parse the body of an http response and inspect it in our retry logic. One option would be to introduce some kind of response mapper we pass in. Another would be to strip down the inner sink, just using it as a service and pulling up things like timeouts and retries to the level of the outer sinks. The latter would mesh well with making those things configurable per-sink, so that might be the most promising route to explore.
Another would be to strip down the inner sink, just using it as a service and pulling up things like timeouts and retries to the level of the outer sinks.
I like that and agree
To add more context, this is definitely a later version feature and I don't think we should get cute with it. The other issue is that the response body can be very large depending on how much data was flushed, which can also cause performance and memory issues. I think a good first step is to check the root level "errors" key and then just retry the entire request. To prevent duplicate data, users can generate their own deterministic IDs which ES will reject on a duplicate requests.
Noting that a user ran into this again today: https://discord.com/channels/742820443487993987/746070591097798688/775478035738525707
Also noting that @fanatid attempted something like @lukesteensen mentioned with retrying partial failures in the #2755 sink, though it looks like we'll be reverting and reintroducing that handling later.
@binarylogic I'd tag this as a have: must unless I'm missing something. I feel like this is a fairly common failure mode for ES to reject individual records and, right now, these events just fall on floor without even any indication in the logs. I agree that just retrying the whole request would be better than nothing given we are using the index action and id conflicts would be handled.
Yeah, there are a few reasons I marked it as should:
- Logstash does not retry partial failures.
- The status code returned by Elasticsearch is 201.
- The most common reason for partial failures is events that violate the schema, which can be ignored via the
ignore_malformedelasticsearch setting. - We can't retry the entire request unless we also set the
_idfield to avoid inserting duplicate data.
Given the above, I'm wondering if partial retries are the best path forward as opposed to a dead-letter sink. There will be circumstances where partial retries will never succeed.
Those are fair reasons, there are certainly cases where partial retries will never succeed and it would be good to have dead-letter support.
However, in the case mentioned in discord, the none of the inserts in the request were successful, and would never succeed on a retry, and yet we were continuing to process events without note. Depending on the source the events are coming from, this could require some work on the user's part to replay messages once they noticed that none were making it into ES.
At the least, I think we should be logging and reporting a metric for failed inserts into ES.
At the least, I think we should be logging and reporting a metric for failed inserts into ES.
Agree, and @bruceg implemented logging for partial failures in https://github.com/timberio/vector/pull/2185.
It seems like that was dropped along the way because it isn't in master:
https://github.com/timberio/vector/blob/a789e42159cc8a617da0c2f8f7df8f5de91fca06/src/sinks/elasticsearch.rs#L300-L326
EDIT never mind, I see it was just moved :smile:
I'll try this out as it seemed like the user did not see any errors.
Yeah, and clearly it would be nice to have a test for this if there's a way.
Indeed, I was mistaken, apologies for the goose chase. I do see errors:
Nov 10 16:11:52.532 INFO vector::app: Log level is enabled. level="info"
Nov 10 16:11:52.534 INFO vector::app: Loading configs. path=["/home/jesse/workspace/vector-configs/elasticsearch.toml"]
Nov 10 16:11:52.560 INFO vector::sources::stdin: Capturing STDIN.
Nov 10 16:11:52.593 INFO vector::topology: Running healthchecks.
Nov 10 16:11:52.596 INFO vector::topology: Starting source. name="stdin"
Nov 10 16:11:52.596 INFO vector::topology: Starting sink. name="elasticsearch"
Nov 10 16:11:52.596 INFO vector: Vector has started. version="0.11.0" git_version="v0.9.0-1298-g76bb25c" released="Tue, 10 Nov 2020 16:02:23 +0000" arch="x86_64"
Nov 10 16:11:52.602 INFO vector::topology::builder: Healthcheck: Passed.
Nov 10 16:11:53.816 ERROR sink{component_kind="sink" component_name=elasticsearch component_type=elasticsearch}:request{request_id=0}: vector::sinks::util::retries: Not retriable; dropping the request. reason="error type: illegal_argument_exception, reason: only write ops with an op_type of create are allowed in data streams"
Nov 10 16:11:54.754 ERROR sink{component_kind="sink" component_name=elasticsearch component_type=elasticsearch}:request{request_id=1}: vector::sinks::util::retries: Not retriable; dropping the request. reason="error type: illegal_argument_exception, reason: only write ops with an op_type of create are allowed in data streams"
Nov 10 16:11:55.873 ERROR sink{component_kind="sink" component_name=elasticsearch component_type=elasticsearch}:request{request_id=2}: vector::sinks::util::retries: Not retriable; dropping the request. reason="error type: illegal_argument_exception, reason: only write ops with an op_type of create are allowed in data streams"
Nov 10 16:11:56.801 ERROR sink{component_kind="sink" component_name=elasticsearch component_type=elasticsearch}:request{request_id=3}: vector::sinks::util::retries: Not retriable; dropping the request. reason="error type: illegal_argument_exception, reason: only write ops with an op_type of create are allowed in data streams"
Nov 10 16:11:57.895 ERROR sink{component_kind="sink" component_name=elasticsearch component_type=elasticsearch}:request{request_id=4}: vector::sinks::util::retries: Not retriable; dropping the request. reason="error type: illegal_argument_exception, reason: only write ops with an op_type of create are allowed in data streams"
We were trying to using Vector as a server which will read data from Kakfa and push into elasticsearch, but right now vector is not support mapping conflicts/error handling. We dont want to loose errored out events. There should be config which will allow to configure DLQ which will allow to manipulate errored out events with the help of transforms and reprocess them.
If partial retries are too hard to implement for now I would like to have an option for just retrying the full request (which is completely safe if setting the _id field) if that would help to get this implemented more quickly.
Is there any progress here?
Is there any progress here?
Not yet, unfortunately, but it is on our radar.
@jszwedko what is the current location of this issue on your radar? :)
@jszwedko what is the current location of this issue on your radar? :)
This may be something we tackle in Q4. We'll likely start with the suggested approach of just retrying the whole payload at first and do something more sophisticated in the future to only retry failed events.