graphql-ruby icon indicating copy to clipboard operation
graphql-ruby copied to clipboard

[PRO] Subscription Topics dashboard index appears to be out of date

Open bolandrm opened this issue 8 months ago • 9 comments
trafficstars

Describe the bug

Sometimes when I click on a topic that I see listed in the dashboard, the list of subscriptions is blank. Then when i navigate back out to the index and refresh, that subscription is gone. It seems to be the act of viewing the topic refreshes the data or something?

https://github.com/user-attachments/assets/a34872f6-1c47-4bf0-808c-e0772d8cc90f

I am pretty sure these aren't just churning out while i'm looking at the page, because I can go back several pages and click on one to reproduce the behavior.

NOTE: we're using the Ably integration.

Versions

graphql version: graphql-pro (1.29.2) rails (or other framework): 7.2.1

Steps to reproduce

  1. Click on various topics in the subscription index.
  2. Sometimes when you get onto the topics page, the page will be blank and have no subscriptions listed.

Expected behavior

The topic should not be listed if there are no longer subscribers

Actual behavior

The topic is appearing in the list despite having no subscribers (I think!)

Additional context

This may not be a huge problem, but I want to have confidence that the system is working properly and it's not attempting to send messages to old/unused topics.

bolandrm avatar Feb 27 '25 19:02 bolandrm

Also, question, what does the Subscribed? column on the topic details page indicate?

bolandrm avatar Feb 27 '25 19:02 bolandrm

Hey, thanks for the detailed report and sorry for the confusing experience here.

Yes, loading a topic runs a still_subscribed? check which can cause the data to be cleaned up. The subscription is supposed to clean itself up, but obviously it isn't -- something is being left behind. I've heard one other report of this happening. Unfortunately we never nailed down exactly why this data was being left behind (I would like to!), but we did produce a script for cleaning up this leftover data: https://gist.github.com/rmosolgo/26a47ce75321b34683b46aa2b630cd53. The other report involved Pusher, not Ably, so I think the problem must be in GraphQL-Pro somewhere 😖

Subscribed? checks the transport service (Ably, in this case) to see if it has active connections for this topic. It should pretty much always be "YES", but it's helpful confirmation when debugging.

As far as the data issue goes, you could give that script a try to clean up the stale data. I'd also love to figure out how it's getting left behind. If you're interested, maybe we could add some logging to the subscription system to try to track it down.

rmosolgo avatar Feb 28 '25 11:02 rmosolgo

Hey, appreciate the response. Followup questions

  1. Is this check performed prior to the server doing work to build and send triggered events?
  2. Should I run this extension script once a day or something to clean up?

Also i'm willing to add logging to help you debug, just let me know how to proceed.

bolandrm avatar Feb 28 '25 13:02 bolandrm

  1. Yes, this check is performed before handling any triggers. (The goal there is to avoid sending updates when nobody is listening... Besides leftover data, this might happen from expired data in Redis or partly-finished data changes in another thread.)
  2. Yes, you could run this periodically to clean up leftover data. Daily wouldn't hurt; the leftover data doesn't interfere with Subscriptions running, the only downside I can think of is that Redis would fill with junk. But it looks like you've got plenty of room, assuming a few GB of storage.

Thanks for your willingness to help track this down -- I'll write up a little patch soon and follow up here.

rmosolgo avatar Mar 01 '25 13:03 rmosolgo

Not super urgent on the debugging btw, I've added the cleanup script to our project.

Question though, could you explain the difference between the cleanup script and stale_ttl_s?

bolandrm avatar Mar 05 '25 12:03 bolandrm

Derp, I should have mentioned it -- I think long stale_ttl_s would accomplish the same thing. I suggested the script because if data was already written in Redis without a TTL, then that would work to clean it up. But configuring stale_ttl_s should keep that from accumulating in the future. You just have to set it to a big enough value that you're sure the client really isn't still using the subscription.

rmosolgo avatar Mar 05 '25 21:03 rmosolgo

The interesting thing is that we had stale_ttl_s set to 12 hours (60 * 60 * 12). And in the video, you can see by comparing the timestamps that there are entries that are older than 12 hours still present.

bolandrm avatar Mar 06 '25 12:03 bolandrm

Hypothesis:

  1. We are setting stale_ttl_s
  2. Imagine for whatever reason, the presence webhook doesn't work as expected.
  3. The subscription key in redis expires
  4. The gql:sub:alltopics subscription list is not recomputed, because there was no event/opportunity to recompute it (??).

This may be why we're seeing stale entries on the dashboard.

bolandrm avatar Mar 06 '25 14:03 bolandrm

Ok, interesting ... Yes, that seems like it could happen. I had written it off in my previous debugging because there was no evidence in application logs or Pusher's logs that the webhook had failed. But you said "doesn't work as expected," and maybe that's it.

I'm remembering a time when I added Sidekiq to an application and we had to change all of our after_save hooks to after_commit because otherwise, Sidekiq jobs would enqueue, start, and finish before the the save process had actually finished ... and that would also "not work as expected" 😅 after_commit ran sooner, and it made the web process/jobs process dance work as expected.

Maybe the same thing is possible here: Pusher is sending the webhook, the app is receiving it, and code is running without an error, but it's just not working.

Here's a patch to debug delete_subscription. It adds a couple of Redis round-trips but eye-balling the scale of your current subscription setup, it looks like that should be alright. If I'm mistaken and you think the performance could be a problem, let me know and I can work to integrate this logging data into the main delete_subscription Lua script. Here it is:

debug_graphql_subscriptions.rb

# config/initializers/debug_graphql_subscriptions.rb
# frozen_string_literal: true

module DebugDeleteSubscription
  def delete_subscription(subscription_id, subscription: nil)
    # gather some data before deleting: 
    sub_key = subscription_key(subscription_id)
    sub_data = with_redis { |r| r.get(sub_key) }
    if sub_data && sub_data != ""
      sub_existed = true
      data = JSON.parse(sub_data)
      topics = data["topics"]
    else
      sub_existed = false
    end

    # do the delete operation:
    super
    
    # log some previous state and current state:
    Rails.logger.info {
      topic = topics.first
      topic_key = topic_key(topic)
      topic_still_exists = nil
      topic_remaining = nil
      with_redis do |r|
        topic_still_exists = r.exists?(topic_key)
        topic_remaining = if topic_still_exists
          r.scard(topic_key)
        else
          -1
        end
      end
      "Gql Sub Del: id=#{subscription_id} existed=#{sub_existed} topic=#{topic} topic_remaining=#{topic_remaining} topic_exists=#{topic_still_exists}"
    }
  end
end

GraphQL::Pro::Subscriptions::RedisStorage.prepend(DebugDeleteSubscription)

I'm hoping this will appear in logs right near the info about the incoming webhook request, which will indicate how that turned out.

I'd expect those logs to all have existed=true, but if there's a race condition, they might have existed=false, which would mean that the Pusher webhook got there before (?!) the Redis state was ready. I'm not sure how that could happen.

Then, I'd expect any lines with topic_remaining=-1 to also have topic_exists=false. I wouldn't expect any to have topic_remaining=0, because the super call is supposed to remove any topic keys that have zero members. (But somehow that is happening.)

Let me know what you think about putting that in production and then checking some output after a little while.

rmosolgo avatar Mar 07 '25 12:03 rmosolgo