feat(electric): Increase the ping heartbeat to 20 seconds from 5 seconds
After some discussion with @kevin-dp we are under the assumption that 5 seconds of ping timeout is quite low. Other popular websocket libraries like https://socket.io use a ping interval of 25 seconds and a ping timeout of 20 seconds. The timeout default changed from 5 seconds to 20 seconds over the years.
On top of this change, we noticed that the scheduler can sometimes undershoot. That is, if we schedule after 5000 ms, we could then obtain a last_msg_diff of 4999.7 ms, which wouldn't match the > operator. Because of that, a small margin error of 500 microseconds is added.
That would prevent obtaining uneven ping durations -> 20s, 20s, 40s, 20s, 40s,... (That's because if it doesn't match, we will schedule for another 20 seconds)
On top of this change, we noticed that the scheduler can sometimes undershoot. That is, if we schedule after 5000 ms, we could then obtain a
last_msg_diffof 4999.7 ms, which wouldn't match the>operator.
This is very suspicious. Are you sure it's not something else? Like, for example, an incoming message that's processed right after a new ping has been scheduled, causing the last_time_msg field to be updated. Timers can fire with a slight delay but they should never fire early.
By the way, taking a step back and squinting our eyes, we don't need to know the exact time difference. We know that the {:timeout, :ping_timer} message is processed once every 20 seconds. In order to decide whether to disconnect the client, we just need to know if any messages have arrived within that interval, either regular messages or a pong.
So instead of introducing an arbitrary 500-microsend timing margin, I would instead set the last_msg_time field to an atom like :ping_received and match on that in the handle_info() handler.
@icehaunter any objections?
@alco If I log the diff this is what I get. Most ones will be a bit over the target duration, but it can get a bit short sometimes. But as you say, it might be something else.
@davidmartos96 @icehaunter Here's an alternative approach that, instead of adding a margin of error to timer firings, keeps a separate timestamp for sent pings. Every time a ping interval elapses, we check if any messages have arrived from the client after the previous ping had been sent. This way it doesn't matter whether the timer fires slightly earlier or slightly later.
@alco That looks easier, thanks! Should I pull those changes in, or can you commit on this branch?
A bit on topic with the pings, we found an odd behavior with the inactive websocket disconnection. I reported it the other day on Discord, but I don't know if you guys have it tracked on Linear. https://discord.com/channels/933657521581858818/1240698235262337103/1240698237992833065
@icehaunter any comments?
@alco no comments, I prefer your approach, GJ to both of you
@davidmartos96 I have pulled your changes into my branch and opened a new PR based on that - https://github.com/electric-sql/electric/pull/1334. Thanks for the contribution!
A bit on topic with the pings, we found an odd behavior with the inactive websocket disconnection. I reported it the other day on Discord, but I don't know if you guys have it tracked on Linear. https://discord.com/channels/933657521581858818/1240698235262337103/1240698237992833065
I have filed this as a GH issue here - https://github.com/electric-sql/electric/issues/1335.