graphql-ruby-fragment_cache
graphql-ruby-fragment_cache copied to clipboard
Fragment cache somtimes returns null despite cached data loading initially
Hi there,
ENV: We are currently using your gem on our Rails api which passes data through to an Apollo Angular app. We originally had a pure rails solution but got better performance results with your gem.
Issue: The problem is that sometimes the cache returns null. This doesn't happen locally but happens when we use the cache on Redis for staging and production. The cached data is returned most of the time but sometimes returns null. We have investigated as much as we could: we first thought it might be the front end but we have narrowed it down to the cache returning null.
How to reproduce: This only works some of the time
- load the site the first time to populate the cache (works)
- hard refresh and reload the site (network results show that cache is working)

- waiting a while (5 - 10 ish minutes) and loading the page again (returns null). The cache set to expire after 2 hours. Checking Redis shows that the key is still present when this failure occurs.

Question: Do you have any ideas on what might be causing this or what we might be able to do? Thank you in advance.
Hi @RudiBoshoff!
That sounds really strange, because gem does reading in a dead simple way:
FragmentCache.cache_store.read(cache_key).tap do |cached|
return NIL_IN_CACHE if cached.nil? && FragmentCache.cache_store.exist?(cache_key)
end
In this case FragmentCache.cache_store is a Redis adapter, and if you see that key is still there, the value should be loaded without any issues. Since you know the key, could you please try to call FragmentCache.cache_store.read(cache_key) from the console and make sure value not nil?
Reading happens here, looks like it also does nothing more that a nilcheck.
Closing this for now, feel free to reopen 🙂
FWIW: we're seeing similar behavior.
Isn't there a race condition.
FragmentCache.cache_store.read(cache_key)returns nil for an uncached value.- Someone else populates the cache key
FragmentCache.cache_store.exist?(cache_key)return true (event though it wasn't cached for step 1)
Thoughts? @DmitryTsepelev
We've been doing some testing in production with the following
def value_from_cache
return nil unless FragmentCache.cache_store.exist?(cache_key)
FragmentCache.cache_store.read(cache_key).tap do |cached|
return NIL_IN_CACHE if cached.nil? && FragmentCache.cache_store.exist?(cache_key)
end
end
It seems to have addressed the race condition
@RudiBoshoff @DmitryTsepelev I think we should reopen this
@gsdean same here, can you explain how the workaround works? I understand the problem derives from fetching cache multiple times, but is it safe to fetch cache multiple times in that way?
The workaround, isn't perfect. There is still a chance a race condition could cause a misread from the cache (specifically when an eviction occurs after the initial read, and then the cache is repopulated). Basically it reduces the risk of value_from_cahe incorrectly returning nil from probable to less probable. I'll let you decide if that is safe or not.
For what it's worth we've tested both with and without this patch in a high load/concurrency production environment and without this patch we see numerous intermittent errors where nil is returned incorrectly. With the patch we have yet to have an error.
Sorry guys I don't know how but github does not sends me events from some projects. I'll take a look tomorrow morning
If I understand correctly, the problem is that race condition makes cache to return nil even if it's not nil anymore. This is only possible when it was nil before cache was rewritten. I think this behavior is kinda acceptable, because user not gets wrong data, it's just outdated, and the tradeoff of the cache is to have fast data vs. fresh data.
I guess we could add it to the docs and call it a day, what do you think?
The cache returns the wrong value not an outdated value. I do not believe documentation alone is sufficient. In a moderately concurrent deployment, this is a serious issue.
If I understand correctly, the problem is that race condition makes cache to return nil even if it's not nil anymore.
in my situation, the problem seems that race condition makes cache to return nil even though it has never been nil before.
In reality, there are only two possibilities: the cache exists and is filled with values, or the cache does not exist. However, it seems to behave as if the cache exists and the value is nil.
The only way to synchronize threads here is using some kind of lock: we could make it configurable and let users to opt–in. It will fix the issue completely but will slow down things a bit. WDYT?
Also, sounds like we could change the behavior to keep something different rather than nil in cache. In this case we will have no problems with differentiating scenarios "cache contains nil" and "cache does not have this key"
My vote would be to either
- Just take the hit on the extra read (as we've done in our workaround). This is a fairly pragmatic solution. Maybe start here and add an issue to revisit the extra read.
OR
- Avoid caching nil by default (as you've suggested). Perhaps we could also add a parameter
cache_nil: trueto preserve compatibility and use the extra read from 1.
I would actively discourage locks. As you've said they would be slow, but also very hard to manage in a multi-server deployment such as the one we have in production (load balancer in front of multiple web servers).
I'm not familiar with the cache API or instrumentation API, but it seems that the read method records whether or not there is a cache hit, so by using it, can't we call read only once without calling exist? ?
https://github.com/rails/rails/blob/v7.0.4.2/activesupport/lib/active_support/cache.rb#L379
@tsugitta I agree, this is the most correct fix, but it seemed like a big change 🤷
Big change is not an issue, we can make a major release. The problem I'm trying to solve is that read returns nil in either when there's no such key at all (so we need to run the resolver method) and when there is a key with nil (so we want to return nil without running the resultion again). We also could add an option to force re–resolution when there's nil for any reason (e.g., resolve_when_nil: true).
Do you want to create the PR for the workaround?
The problem I'm trying to solve is that read returns nil in either when there's no such key at all (so we need to run the resolver method) and when there is a key with nil (so we want to return nil without running the resultion again).
Yes, I'm also trying to solve the same problem. As I commented above, read seems to notify us with Instrumentation whether there is a cache hit or not, so by looking at it, we can determine whether the cache doesn't exist or the cache exists and the value is nil.
In addition, for the case where nil is regarded as a valid cache value, there seems to be an advantage that the two requests for read and exist? can be reduced to one.
Aside from that, the workaround seems great. I think it would be nice if I could submit a PR, but I haven't figured out the structure of the library yet, so I don't know where to start now and it may take some time. So at least I don't think I am the right person.
I had overlooked that the version I was using was outdated. After upgrading this library from 1.9.1 to 1.18.1 (and Rails from 6.1.4.4 to 7.0.4), this problem, which used to happen dozens of times a day in our application that accesses cache frequently, never happened again (has never been reported to Sentry).
Interesting 🤔I didn't remember we addressed it here in the gem, maybe something in Rails had changed for better
As far as I can tell this is still an issue. Any thoughts on the PR? It would be really great to either merge that or leave this open until it’s resolved.