skyfall icon indicating copy to clipboard operation
skyfall copied to clipboard

Periods of lag observed on 7th November 2024

Open godfrey-altmetric opened this issue 1 year ago • 2 comments

Hi @mackuba.

We've been running a connection to the bsky.network firehose with this gem for a few weeks now, and our consumer manages to keep up with the firehose events in near real-time. As of yesterday, 7th Nov 2024, we noticed at least 2 periods where our consumer started lagging, falling behind with a peak delay of ~9600s/2.7hrs:

Screenshot 2024-11-08 at 14 23 05

So far today we have not observed any delay, so this perhaps may be a one time event, as Bluesky have noted increased activity in the last few days since the election. However, we've been advised by Bluesky that they did not detect any significant delays in the relays or the firehose on their side.

Is this something you noticed, or other users of the gem, experienced also?

Update: Are there perhaps any external API dependencies (besides the main firehose) that are called within the gem that could help explain this?

godfrey-altmetric avatar Nov 08 '24 14:11 godfrey-altmetric

Hey, sorry I forgot to reply…

I did actually have some issues on that day, and someone else did too, see thread here: https://bsky.app/profile/mackuba.eu/post/3laeiuzz2fw26 - but I don't know if they also use this library (although their avatar says "Ruby" :)

Generally there have been tons of issues in the past couple of weeks, but I think everyone had those, since the relay as a whole was crashing… I'm hoping that things will get better now. I generally didn't have much disconnection issues with this setup until the Brazil wave in September, and then only on days with very high traffic (though now that means all days…).

Two things that could help:

  1. There's an optional "heartbeat" feature that I've been testing for months on a branch and was released in 0.4, which monitors if there haven't been any new events in some time and forces a reconnect then. For me this generally solves or at least counteracts most issues that normal auto-reconnect doesn't handle, except whatever has been happening with the relay this week, but hopefully that's fixed now. This needs to be enabled using a check_heartbeat flag, see latest docs here: https://github.com/mackuba/skyfall?tab=readme-ov-file#reconnection-logic. I think this would have reconnected in this case, since if I read this chart correctly, you just weren't receiving anything since ~14:20.

  2. Switching to a Jetstream source, support added in 0.5 - Jetstream uses less bandwidth and I think it's generally been more stable.

I don't know exactly what is the source of this issue where the events just stop coming but the socket doesn't disconnect, if it's something in the server's implementation or one of the client libraries I'm using, or some incompatibility between the two, or if it's just something that websockets do sometimes… but it's been happening occasionally, which is why I added this heartbeat thing. But during the spring-summer it was happening so rarely that it was hard to test this, because it was sometimes never triggered for a month or more. It only started happening more with the increase in traffic in the autumn.

mackuba avatar Nov 23 '24 16:11 mackuba

No worries @mackuba, thanks for the above info. It does correlate with what we observed, but since that day we've not seen anything of the same magnitude in terms of delays and disconnections.

godfrey-altmetric avatar Nov 25 '24 09:11 godfrey-altmetric