graphql-ruby-fragment_cache Fragment cache somtimes returns null despite cached data loading initially

Hi there,

ENV: We are currently using your gem on our Rails api which passes data through to an Apollo Angular app. We originally had a pure rails solution but got better performance results with your gem.

Issue: The problem is that sometimes the cache returns null. This doesn't happen locally but happens when we use the cache on Redis for staging and production. The cached data is returned most of the time but sometimes returns null. We have investigated as much as we could: we first thought it might be the front end but we have narrowed it down to the cache returning null.

How to reproduce: This only works some of the time

load the site the first time to populate the cache (works)
hard refresh and reload the site (network results show that cache is working)
waiting a while (5 - 10 ish minutes) and loading the page again (returns null). The cache set to expire after 2 hours. Checking Redis shows that the key is still present when this failure occurs.

Question: Do you have any ideas on what might be causing this or what we might be able to do? Thank you in advance.

Aug 30 '21 07:08 RudiBoshoff

Hi @RudiBoshoff!

That sounds really strange, because gem does reading in a dead simple way:

FragmentCache.cache_store.read(cache_key).tap do |cached|
  return NIL_IN_CACHE if cached.nil? && FragmentCache.cache_store.exist?(cache_key)
end

In this case FragmentCache.cache_store is a Redis adapter, and if you see that key is still there, the value should be loaded without any issues. Since you know the key, could you please try to call FragmentCache.cache_store.read(cache_key) from the console and make sure value not nil?

Reading happens here, looks like it also does nothing more that a nilcheck.

Sep 09 '21 14:09 DmitryTsepelev

Closing this for now, feel free to reopen 🙂

Oct 19 '22 17:10 DmitryTsepelev

FWIW: we're seeing similar behavior.

Isn't there a race condition.

FragmentCache.cache_store.read(cache_key) returns nil for an uncached value.
Someone else populates the cache key
FragmentCache.cache_store.exist?(cache_key) return true (event though it wasn't cached for step 1)

Thoughts? @DmitryTsepelev

Jan 11 '23 07:01 gsdean

We've been doing some testing in production with the following

def value_from_cache
  return nil unless FragmentCache.cache_store.exist?(cache_key)

  FragmentCache.cache_store.read(cache_key).tap do |cached|
    return NIL_IN_CACHE if cached.nil? && FragmentCache.cache_store.exist?(cache_key)
  end
end

It seems to have addressed the race condition

@RudiBoshoff @DmitryTsepelev I think we should reopen this

Jan 12 '23 04:01 gsdean

@gsdean same here, can you explain how the workaround works? I understand the problem derives from fetching cache multiple times, but is it safe to fetch cache multiple times in that way?

Feb 07 '23 12:02 tsugitta

The workaround, isn't perfect. There is still a chance a race condition could cause a misread from the cache (specifically when an eviction occurs after the initial read, and then the cache is repopulated). Basically it reduces the risk of value_from_cahe incorrectly returning nil from probable to less probable. I'll let you decide if that is safe or not.

For what it's worth we've tested both with and without this patch in a high load/concurrency production environment and without this patch we see numerous intermittent errors where nil is returned incorrectly. With the patch we have yet to have an error.

Feb 07 '23 16:02 gsdean

Sorry guys I don't know how but github does not sends me events from some projects. I'll take a look tomorrow morning

Feb 07 '23 19:02 DmitryTsepelev

If I understand correctly, the problem is that race condition makes cache to return nil even if it's not nil anymore. This is only possible when it was nil before cache was rewritten. I think this behavior is kinda acceptable, because user not gets wrong data, it's just outdated, and the tradeoff of the cache is to have fast data vs. fresh data.

I guess we could add it to the docs and call it a day, what do you think?

Feb 08 '23 13:02 DmitryTsepelev

The cache returns the wrong value not an outdated value. I do not believe documentation alone is sufficient. In a moderately concurrent deployment, this is a serious issue.

Feb 08 '23 14:02 gsdean

If I understand correctly, the problem is that race condition makes cache to return nil even if it's not nil anymore.

in my situation, the problem seems that race condition makes cache to return nil even though it has never been nil before.

In reality, there are only two possibilities: the cache exists and is filled with values, or the cache does not exist. However, it seems to behave as if the cache exists and the value is nil.

Feb 08 '23 14:02 tsugitta

The only way to synchronize threads here is using some kind of lock: we could make it configurable and let users to opt–in. It will fix the issue completely but will slow down things a bit. WDYT?

Feb 08 '23 15:02 DmitryTsepelev

Also, sounds like we could change the behavior to keep something different rather than nil in cache. In this case we will have no problems with differentiating scenarios "cache contains nil" and "cache does not have this key"

Feb 08 '23 15:02 DmitryTsepelev

My vote would be to either

Just take the hit on the extra read (as we've done in our workaround). This is a fairly pragmatic solution. Maybe start here and add an issue to revisit the extra read.

OR

Avoid caching nil by default (as you've suggested). Perhaps we could also add a parameter cache_nil: true to preserve compatibility and use the extra read from 1.

I would actively discourage locks. As you've said they would be slow, but also very hard to manage in a multi-server deployment such as the one we have in production (load balancer in front of multiple web servers).

Feb 08 '23 16:02 gsdean

I'm not familiar with the cache API or instrumentation API, but it seems that the read method records whether or not there is a cache hit, so by using it, can't we call read only once without calling exist? ?

https://github.com/rails/rails/blob/v7.0.4.2/activesupport/lib/active_support/cache.rb#L379

Feb 08 '23 17:02 tsugitta

@tsugitta I agree, this is the most correct fix, but it seemed like a big change 🤷

Feb 08 '23 17:02 gsdean

Big change is not an issue, we can make a major release. The problem I'm trying to solve is that read returns nil in either when there's no such key at all (so we need to run the resolver method) and when there is a key with nil (so we want to return nil without running the resultion again). We also could add an option to force re–resolution when there's nil for any reason (e.g., resolve_when_nil: true).

Do you want to create the PR for the workaround?

Feb 08 '23 17:02 DmitryTsepelev

The problem I'm trying to solve is that read returns nil in either when there's no such key at all (so we need to run the resolver method) and when there is a key with nil (so we want to return nil without running the resultion again).

Yes, I'm also trying to solve the same problem. As I commented above, read seems to notify us with Instrumentation whether there is a cache hit or not, so by looking at it, we can determine whether the cache doesn't exist or the cache exists and the value is nil. In addition, for the case where nil is regarded as a valid cache value, there seems to be an advantage that the two requests for read and exist? can be reduced to one.

Aside from that, the workaround seems great. I think it would be nice if I could submit a PR, but I haven't figured out the structure of the library yet, so I don't know where to start now and it may take some time. So at least I don't think I am the right person.

Feb 09 '23 13:02 tsugitta

I had overlooked that the version I was using was outdated. After upgrading this library from 1.9.1 to 1.18.1 (and Rails from 6.1.4.4 to 7.0.4), this problem, which used to happen dozens of times a day in our application that accesses cache frequently, never happened again (has never been reported to Sentry).

Feb 10 '23 06:02 tsugitta

Interesting 🤔I didn't remember we addressed it here in the gem, maybe something in Rails had changed for better

Feb 12 '23 12:02 DmitryTsepelev

As far as I can tell this is still an issue. Any thoughts on the PR? It would be really great to either merge that or leave this open until it’s resolved.

Jun 03 '23 12:06 gsdean

graphql-ruby-fragment_cache graphql-ruby-fragment_cache copied to clipboard

Fragment cache somtimes returns null despite cached data loading initially

graphql-ruby-fragment_cache
graphql-ruby-fragment_cache copied to clipboard