graphql-ruby
graphql-ruby copied to clipboard
[PRO] Subscription Topics dashboard index appears to be out of date
Describe the bug
Sometimes when I click on a topic that I see listed in the dashboard, the list of subscriptions is blank. Then when i navigate back out to the index and refresh, that subscription is gone. It seems to be the act of viewing the topic refreshes the data or something?
https://github.com/user-attachments/assets/a34872f6-1c47-4bf0-808c-e0772d8cc90f
I am pretty sure these aren't just churning out while i'm looking at the page, because I can go back several pages and click on one to reproduce the behavior.
NOTE: we're using the Ably integration.
Versions
graphql version: graphql-pro (1.29.2)
rails (or other framework): 7.2.1
Steps to reproduce
- Click on various topics in the subscription index.
- Sometimes when you get onto the topics page, the page will be blank and have no subscriptions listed.
Expected behavior
The topic should not be listed if there are no longer subscribers
Actual behavior
The topic is appearing in the list despite having no subscribers (I think!)
Additional context
This may not be a huge problem, but I want to have confidence that the system is working properly and it's not attempting to send messages to old/unused topics.
Also, question, what does the Subscribed? column on the topic details page indicate?
Hey, thanks for the detailed report and sorry for the confusing experience here.
Yes, loading a topic runs a still_subscribed? check which can cause the data to be cleaned up. The subscription is supposed to clean itself up, but obviously it isn't -- something is being left behind. I've heard one other report of this happening. Unfortunately we never nailed down exactly why this data was being left behind (I would like to!), but we did produce a script for cleaning up this leftover data: https://gist.github.com/rmosolgo/26a47ce75321b34683b46aa2b630cd53. The other report involved Pusher, not Ably, so I think the problem must be in GraphQL-Pro somewhere 😖
Subscribed? checks the transport service (Ably, in this case) to see if it has active connections for this topic. It should pretty much always be "YES", but it's helpful confirmation when debugging.
As far as the data issue goes, you could give that script a try to clean up the stale data. I'd also love to figure out how it's getting left behind. If you're interested, maybe we could add some logging to the subscription system to try to track it down.
Hey, appreciate the response. Followup questions
- Is this check performed prior to the server doing work to build and send triggered events?
- Should I run this extension script once a day or something to clean up?
Also i'm willing to add logging to help you debug, just let me know how to proceed.
- Yes, this check is performed before handling any triggers. (The goal there is to avoid sending updates when nobody is listening... Besides leftover data, this might happen from expired data in Redis or partly-finished data changes in another thread.)
- Yes, you could run this periodically to clean up leftover data. Daily wouldn't hurt; the leftover data doesn't interfere with Subscriptions running, the only downside I can think of is that Redis would fill with junk. But it looks like you've got plenty of room, assuming a few GB of storage.
Thanks for your willingness to help track this down -- I'll write up a little patch soon and follow up here.
Not super urgent on the debugging btw, I've added the cleanup script to our project.
Question though, could you explain the difference between the cleanup script and stale_ttl_s?
Derp, I should have mentioned it -- I think long stale_ttl_s would accomplish the same thing. I suggested the script because if data was already written in Redis without a TTL, then that would work to clean it up. But configuring stale_ttl_s should keep that from accumulating in the future. You just have to set it to a big enough value that you're sure the client really isn't still using the subscription.
The interesting thing is that we had stale_ttl_s set to 12 hours (60 * 60 * 12). And in the video, you can see by comparing the timestamps that there are entries that are older than 12 hours still present.
Hypothesis:
- We are setting
stale_ttl_s - Imagine for whatever reason, the
presencewebhook doesn't work as expected. - The subscription key in redis expires
- The
gql:sub:alltopicssubscription list is not recomputed, because there was no event/opportunity to recompute it (??).
This may be why we're seeing stale entries on the dashboard.
Ok, interesting ... Yes, that seems like it could happen. I had written it off in my previous debugging because there was no evidence in application logs or Pusher's logs that the webhook had failed. But you said "doesn't work as expected," and maybe that's it.
I'm remembering a time when I added Sidekiq to an application and we had to change all of our after_save hooks to after_commit because otherwise, Sidekiq jobs would enqueue, start, and finish before the the save process had actually finished ... and that would also "not work as expected" 😅 after_commit ran sooner, and it made the web process/jobs process dance work as expected.
Maybe the same thing is possible here: Pusher is sending the webhook, the app is receiving it, and code is running without an error, but it's just not working.
Here's a patch to debug delete_subscription. It adds a couple of Redis round-trips but eye-balling the scale of your current subscription setup, it looks like that should be alright. If I'm mistaken and you think the performance could be a problem, let me know and I can work to integrate this logging data into the main delete_subscription Lua script. Here it is:
debug_graphql_subscriptions.rb
# config/initializers/debug_graphql_subscriptions.rb
# frozen_string_literal: true
module DebugDeleteSubscription
def delete_subscription(subscription_id, subscription: nil)
# gather some data before deleting:
sub_key = subscription_key(subscription_id)
sub_data = with_redis { |r| r.get(sub_key) }
if sub_data && sub_data != ""
sub_existed = true
data = JSON.parse(sub_data)
topics = data["topics"]
else
sub_existed = false
end
# do the delete operation:
super
# log some previous state and current state:
Rails.logger.info {
topic = topics.first
topic_key = topic_key(topic)
topic_still_exists = nil
topic_remaining = nil
with_redis do |r|
topic_still_exists = r.exists?(topic_key)
topic_remaining = if topic_still_exists
r.scard(topic_key)
else
-1
end
end
"Gql Sub Del: id=#{subscription_id} existed=#{sub_existed} topic=#{topic} topic_remaining=#{topic_remaining} topic_exists=#{topic_still_exists}"
}
end
end
GraphQL::Pro::Subscriptions::RedisStorage.prepend(DebugDeleteSubscription)
I'm hoping this will appear in logs right near the info about the incoming webhook request, which will indicate how that turned out.
I'd expect those logs to all have existed=true, but if there's a race condition, they might have existed=false, which would mean that the Pusher webhook got there before (?!) the Redis state was ready. I'm not sure how that could happen.
Then, I'd expect any lines with topic_remaining=-1 to also have topic_exists=false. I wouldn't expect any to have topic_remaining=0, because the super call is supposed to remove any topic keys that have zero members. (But somehow that is happening.)
Let me know what you think about putting that in production and then checking some output after a little while.