vector icon indicating copy to clipboard operation
vector copied to clipboard

Vector sources acknowledge discarded events

Open jszwedko opened this issue 3 years ago • 17 comments

When end-to-end acknowledgments are enabled, Vector sources (where possible) are configured to wait until events are successful processed by associated sinks before acknowledging to the client connected to the source.

However, if an event is discarded by the configured topology, the source acknowledges this event as successfully processed. This was the intended implementation, but it is becoming clearer that this is unexpected behavior. As a Vector user, I would not expect these events to be marked as acknowledged.

Vector can discard events early when:

  • A transform errors and drops the event (like remap when drop_on_err = true)
  • A component output is not used as an input to another component (e.g. an unconsumed route)

Instead, these should result in negative acknowledgements to the source.

Prompted by https://discord.com/channels/742820443487993987/746070591097798688/964167175923433522

Related: https://github.com/vectordotdev/vector/issues/12109

jszwedko avatar Apr 14 '22 19:04 jszwedko

We could fail to boot a Vector config that has an unconsumed output for a source with acks enabled.

jszwedko avatar Apr 14 '22 19:04 jszwedko

Another idea courtesy of @binarylogic: we force all possible component errors to be handled explicitly when ack's are enabled.

Expanding that some, I think it could look like: all components that can drop events would have a on_error: (reject|discard|route) config option that would let users configure the acking behavior when an error occurs. If:

  • reject it would reject the event and cause a negative ack in the source
  • discard it would be discarded and ack'd
  • route it would be routed to a error output

jszwedko avatar Apr 14 '22 19:04 jszwedko

cc/ @hhromic for any thoughts since your discord conversation prompted this discussion 🙂

jszwedko avatar Apr 14 '22 19:04 jszwedko

reject it would reject the event and cause a negative ack in the source

What's a negative ack?

spencergilbert avatar Apr 14 '22 19:04 spencergilbert

reject it would reject the event and cause a negative ack in the source

What's a negative ack?

Ah, for HTTP, this would be something like returning a 400.

For sources like kafka there is no "negative ack" so it just wouldn't ack.

jszwedko avatar Apr 14 '22 19:04 jszwedko

An alternative UX could be to have the blackhole sink have an option for rejecting data (or a new sink) that would allow us just to always have an error named output that the user would be responsible for routing to a blackhole sink to discard or reject when end-to-end acks are enabled.

jszwedko avatar Apr 14 '22 19:04 jszwedko

cc/ @hhromic for any thoughts since your discord conversation prompted this discussion 🙂

Hello all, first of all, thanks for considering the opinion of users. It is highly appreciated :) Not every software package we use can be praised with the same :P

While checking the documentation again, I stumbled upon the actual place where acknowledgement is properly defined:

For a source that supports end-to-end acknowledgements and is connected to a sink that has the acknowledgements option enabled, this will cause it to wait for all connected sinks to either mark the event as delivered, or to persist the events to a disk buffer, if configured on the sink, before acknowledging receipt of the event. If a sink signals the event was rejected and the source can provide an error response (i.e. through an HTTP error code), then the source will provide an appropriate error to the sending agent.

More importantly, it also clearly states what happens with dropped events:

Some transforms will drop events as part of their normal operation. For example, the dedupe transform will drop events that duplicate another recent event in order to reduce data volume. When this happens, the event will be considered as having been delivered with no error.

Therefore, in terms of documentation/behaviour, Vector seems to just be operating as designed :). I wouldn't classify this issue as a bug but an enhancement or feature request instead.

I was previously reading this other documentation page, which seems not so complete as above, and thus raising my questions in Discord: https://vector.dev/docs/about/under-the-hood/architecture/end-to-end-acknowledgements/

That being said, I think good enhancements as proposed are:

  • https://github.com/vectordotdev/vector/issues/12217#issuecomment-1099544483 : This makes a lot of sense because enabling acks should also imply the user wanting more strict/controlled topologies. On the other hand, it might become annoying for users that know what they are doing. Especially when debugging.
  • https://github.com/vectordotdev/vector/issues/12217#issuecomment-1099551164 : I feel this suggestion can devolve into "configuration options creep". For the remap transform we already have reroute_dropped, drop_on_abort and drop_on_error. So I would instead simplify on managing the behaviour using these three. For the route transform we also now have the _unmatched output that can be routed. See next.
  • https://github.com/vectordotdev/vector/issues/12217#issuecomment-1099563743 : I also think this is a good idea. The user would then need to explicitly define a sink that "rejects messages for ack", i.e. a "rejector" sink where the user can redirect all messages that should be rejected upon desired conditions. This can couple very well with the existing remap and route options (see above). More importantly, this would force the user to make more explicit/clear topologies.

So in summary, I think the way to go would be to make sure that transforms that can drop messages to implement a dropped output like the remap transform and then provide a sink that can be used to wire up these outputs. The user then can have full control of what to reject and what to discard depending on how the topology is constructed.

hhromic avatar Apr 18 '22 11:04 hhromic

Thank you for your valuable input.

Therefore, in terms of documentation/behaviour, Vector seems to just be operating as designed :). I wouldn't classify this issue as a bug but an enhancement or feature request instead.

For clarification, there are two classes of discarded events: those that are discarded as a normal part of the transform's behavior, like described for dedupe or filter or the like, and those that are discarded as a result of an error. It is the latter that are primarily at issue here while the documentation describes the former.

bruceg avatar Apr 18 '22 15:04 bruceg

@bruceg ah that's a good clarification indeed.

Then it looks like we should better define dropped as normal transform operation. For example, in the route transform, events sent to the _unmatched output are dropped as normal? Same in the remap transform, events sent to the *.dropped output (for example via abort) are dropped as normal? In the case of remap is tricky because the programmer currently has no way to express a dropped as normal intention.

hhromic avatar Apr 18 '22 16:04 hhromic

in the route transform, events sent to the _unmatched output are dropped as normal? Same in the remap transform, events sent to the *.dropped output (for example via abort) are dropped as normal?

I think these are where one of @jszwedko's suggestions would come into play, where we would require all outputs to be consumed by another component at a configuration level, in which case they are not dropped at all.

bruceg avatar Apr 18 '22 16:04 bruceg

Yes, indeed. Especially the suggestion about adding configurable ack behaviour to the blackhole sink. In this way the user can explicitly configure how the acks should be done via topology construction.

hhromic avatar Apr 18 '22 17:04 hhromic

Just updating the status. I realize the lines are blurry here, but I think this is a bug given normal user expectations for this feature.

jszwedko avatar Apr 19 '22 18:04 jszwedko

I caught this error and the consuming from Kafka totally stopped:

2022-08-26T03:18:28.849503Z ERROR source{component_kind="source" component_id=kafka_in_be component_type=kafka component_name=kafka_in_be}: vector::internal_events::kafka: Event received a negative acknowledgment, topic has been stopped. error_code="negative_acknowledgement" error_type="acknowledgment_failed" stage="sending" topic="log.java_standard_log" partition=1 offset=12277332826

I build from newest master version these days, does it related to this bug?

honganan avatar Aug 26 '22 06:08 honganan

Hi @honganan ! What you see there is a change to have Vector avoid acknowledging the event in Kafka if it failed to successfully process it. This change was made to avoid data loss by letting operators intervene to fix the issue that caused the failure.

jszwedko avatar Aug 26 '22 15:08 jszwedko

Hi @honganan ! What you see there is a change to have Vector avoid acknowledging the event in Kafka if it failed to successfully process it. This change was made to avoid data loss by letting operators intervene to fix the issue that caused the failure.

After reconsidering this behavior in light of a newly codified UX principle some planned work on error handling, we've decided to revert it for v0.24.0.

jszwedko avatar Aug 29 '22 19:08 jszwedko

Hi @honganan ! What you see there is a change to have Vector avoid acknowledging the event in Kafka if it failed to successfully process it. This change was made to avoid data loss by letting operators intervene to fix the issue that caused the failure.

After reconsidering this behavior in light of a newly codified UX principle some planned work on error handling, we've decided to revert it for v0.24.0.

That would be good for UX.

honganan avatar Aug 30 '22 02:08 honganan

weird workaround i found for this, use reroute_dropped: true on the remap that drops events, then create an invalid sink that rejects everything, such as:

error:
  type: file
  path: ""  # intentionally invalid
  encoding:
    codec: json
  acknowledgements:
    enabled: true
  inputs:
    - events_parsed.dropped

this will reject all acks (with 500 internal server error for http source)

frankh avatar Aug 02 '24 16:08 frankh