commanded
commanded copied to clipboard
Skipped event with concurrency enabled
I have the following setup:
Use Postgresql as an event store. Have 150 events in the events table. Created a TestHandler with concurrency = 5 Make TestHandler crash at event 40 The TestHandler is registered in the superviser with restart: :temporary.
Observed behavior:
- error is logged
- 2 pids died
- the handler entry in the subscriptions table has last_seen = 150
- when restarting the app/handler, all is fine, event is skipped
What is the course of action to avoid skipping events? What happens if i delpoy while the handlers are processing events. It may also kill pids and events are loaded based on the last_seen from the event store subscriptions table.
Are you able to reproduce this easily? Could you share an example or test case?
You are sure the effect of running TestHandler wasn't run? Maybe it was run by one of the other 4 instances of your handler?
Here is a an example project that replicates what you are seeing:
https://github.com/drteeth/commanded_issue_556
Run the tests to see the failure.
My question is how should this work? Should the remaining handlers stop? Should the instance that failed start from 5 when it starts back up?
I'm thinking the behavior for the partition of the failed handler should be the same as for a single thread -e.g. it does not process any other events. Other handlers can continue processing the events corresponding to their partitions.
When fixing the code issue and restarting the failed partition handler, it should resume from where it left off.
Can you confirm for me that the failure in the handler is permanent? Meaning this is no a temporary failure, no amount of retries is going to fix it? Am I assuming correctly here?
Other handlers can continue processing the events corresponding to their partitions.
What would this mean though?
If I had 2 concurrent handlers, the first processing odd events, the second even ones, when the first one encounters an error and ultimately dies, should the second one only process even events still? Should it take over from the failed on? Presumably not as it would also die.
The event store tracks the last seen event per subscription, not per partition. This is mostly to support dynamic partition sizing where you can adjust the concurrency over time.
If an event cannot be processed within one partition then the last seen event checkpoint should not move past that event. On restart the same problematic event should be retried, along with any later events that may already have been processed by other partitions. This is the at-least-once guarantee which can mean events may be processed more than once. (edited)
My example uses the in-memory store. I think it should use the PG adapter to be worth anything. I'll try to update it to see if that changes things.
Updated my example to use PG adapter and to expect the subscription to not pass the failed event.
@slashdotdash here's a failing test for the issue: https://github.com/drteeth/commanded/commit/e43c89925dec0bc5798f8e7cdb305db301f2702c
Can you confirm for me that the failure in the handler is permanent? Meaning this is no a temporary failure, no amount of retries is going to fix it? Am I assuming correctly here?
Yes, I confirm the failure is permanent, meaning it would need a code change to process the event successfully.
Other handlers can continue processing the events corresponding to their partitions.
What would this mean though?
If I had 2 concurrent handlers, the first processing odd events, the second even ones, when the first one encounters an error and ultimately dies, should the second one only process even events still? Should it take over from the failed on? Presumably not as it would also die.
It is probably a nice to have to continue processing the partitions where it can.
Sorry for the late reply, you're moving too fast.
We ran into this as well, but from the thought experiment of 'what if we wanted to edit a handler's concurrency from 2 to 4'. We figured that one would have to 'barrier' at some event number, ie. tell all partitions stop at some future event number, wait for any slower handlers to catch up and then toggle the whole handler off and back on again with the new concurrency config. When looking through the code we saw that there was probably more to do, as examined in this issue.
If the partitioning key is the same, is it possible to just start the event handler at the minimum of all the handlers' 'last seen' events, then partitioned handlers that receive old events would be able to skip over them as normal, bc of its own 'last seen' number?
Edit: Oh, I thought that concurrency created multiple entries in the subscriptions
table, each with their own 'last_seen'. Something along those lines was assumed by me in my comments above.