electric icon indicating copy to clipboard operation
electric copied to clipboard

fix: Ensure shape cleanup after consumer unexpected shutdown

Open msfstef opened this issue 8 months ago • 1 comments

Closes https://github.com/electric-sql/electric/issues/2597

The infinite loop would occur in the case where a consumer would not longer exist but the ShapeStatus table still had an entry for the corresponding shape.

After going through the code and consulting with @icehaunter as well, the issue was that there are cases where the internal cleanup run by the consumer itself in the terminate callback would not always occur - the only surefire way to go about this is to monitor the consumers and trigger cleanups when unexpected shutdowns occur.

I initially added two failing tests covering this situation, which do indeed get fixed by introducing this monitoring + cleanup of shape consumers in the ShapeCache.

Technically we should be able to get rid of the cleanups in the terminate callback in the consumer now, but I figured that whenever the consumer can cleanup after itself it's less load on the shape cache process, and since the shape cache will only attempt a cleanup if the ETS entry is still present (i.e. the consumer failed to clean itself up), I decided to keep both in. Also for the sake of fewer changes and simpler review, I can create a new PR to clean up the unnecessary code and fix tests appropriately.

I introduced an asynchronous publication cleanup call as well, as there is no reason to block the shape cache process which is a central one waiting for the publication update.

I also cleaned up a few things to make use of the behaviours we've defined and fixed some dialyzer/type issues in the process.

Finally, I added a 50ms wait when retrying await_snapshot_start - we should no longer get into this infinite loop situation since this fix is trying to address the root cause, but in case there's more edge cases we at least avoid busy waiting and taking up too much CPU time. Not adding finite retries yet however as we should never need them - let the request time out.

msfstef avatar Apr 29 '25 14:04 msfstef

Based on a discussion with @robacourt I've created a separate ShapeCleaner process that handles the cleanups, and opened a followup PR https://github.com/electric-sql/electric/pull/2663 that removes the cleanup logic from everywhere else.

msfstef avatar Apr 30 '25 14:04 msfstef

@robacourt @icehaunter if you can also review https://github.com/electric-sql/electric/pull/2663 along with this PR - can be merged separately or I can merge them together before going to main

msfstef avatar May 05 '25 10:05 msfstef

@msfstef looks like we've been accidentally working on similar problems and arrived at similar solutions https://github.com/electric-sql/electric/pull/2672. I'm happy to combine them once this is ready

magnetised avatar May 05 '25 15:05 magnetised

Closing in favour of https://github.com/electric-sql/electric/pull/2672 - please do have a look at that one instead

msfstef avatar May 12 '25 16:05 msfstef