worldcubeassociation.org icon indicating copy to clipboard operation
worldcubeassociation.org copied to clipboard

Teams are cached for too long

Open lgarron opened this issue 6 years ago • 8 comments

While debugging tests for #4534 we found that Team.wrc was cached across tests. In theory, this could have repercussions in production (although the impact is limited in practice, because team members don't change often).

lgarron avatar Aug 26 '19 04:08 lgarron

I believe this is the relevant code:

https://github.com/thewca/worldcubeassociation.org/blob/eaac15363afd997b80a21e62cb24a9ee1f8736bd/WcaOnRails/app/models/team.rb#L70-L73

Jambrose777 avatar Aug 26 '19 22:08 Jambrose777

I would assume that clearing the class variable whenever a team member change is made should be sufficient; however we have two worker running in separate processes in production so we would only have one update. Another approach would be to take into account some updated_at field for the team when building the @@teams_by_friendly_id variable, and update that field when a team member is added/removed/changed.

viroulep avatar Sep 05 '19 14:09 viroulep

Completing this issue: this actually raises issues when syncing mailing lists, and may not change the members appropriately until the cache is cleared (ie: the unicorn restart we do when we deploy).

Checking the updated_at field kind of ruins the point of caching as we'd have to emit a select. I don't have a great suggestion for this; another alternative would be to send a restart signal to unicorn before going into the daily jobs.

viroulep avatar Sep 02 '20 07:09 viroulep

Turns out, the process running the Delayed::Job is different from the unicorn one, so we'd probably need to do both restart_app and restart_dj.

viroulep avatar Sep 02 '20 07:09 viroulep

Another alternative: pass some force_reload optional argument when fetching a team, that we can pass when calling Team.board in the rake task.

viroulep avatar Sep 02 '20 07:09 viroulep

This error occured again recently and is critical to some operational WCA procedures (Delegates falsely detected as being on probation, confidential mails still being forwarded to people who have been removed from the team). Unfortunately, I don't have sufficient insight into our Rails caching to be able to figure this out.

Summoning our resident sage @jonatanklosko for your Rails and infrastructure expertise 🙏

As the teams page on the WCA page is public, I think it's fine to share the following bit of diagnosis: 2015MATT05 is no longer a WQAC member. SSHing into the production machine and querying WQAC team members through the bin/rails c console prompt indeed shows that Enzo is not listed as member. But the background jobs keep adding him to the Google Workspace mailing list, even when we manually remove him through the web admin interface. So clearly, he (i.e. the Team.wqac team with all its members) is cached somewhere. But I don't know how to proceed from here 😕

gregorbg avatar Apr 19 '22 09:04 gregorbg

To summarize the problem, we query teams once and store them in the @@teams_by_friendly_id class variable (that's the cache in question). The variable is never invalidated, until we restart the given app process.

Because there are multiple separate processes I don't think there's a straightforward way of invalidating the cache in all of them on demand (when a member is added/removed).

One option I see is invalidating this cache periodically, so in

https://github.com/thewca/worldcubeassociation.org/blob/eaac15363afd997b80a21e62cb24a9ee1f8736bd/WcaOnRails/app/models/team.rb#L70-L73

we could do something like this

@@teams_by_friendly_id ||= nil

if @@teams_by_friendly_id.nil? || @@teams_by_friendly_id_timestamp < 15.minutes.ago
  @@teams_by_friendly_id = all.with_hidden.index_by(&:friendly_id)
  @@teams_by_friendly_id_timestamp = DateTime.now
end

@@teams_by_friendly_id

This way we never use cached teams older than 15 minutes. Given we sync mailing list every hour, with this approach the worst-case propagation time would be 1 hour and 15 minutes.

@viroulep let me know if you have any thoughts on this :)

jonatanklosko avatar Apr 19 '22 10:04 jonatanklosko

Just for the record: I don't have more to add to Jonatan's analysis; I did remember this issue happened once in the past, but I guess we let it go, maybe because at the time we had frequent enough deploys (or less team changes) so that issue wasn't very relevant.

I definitely support Jonatan's workaround! Unfortunately there is no way to properly invalidate the cache on team change because we have multiple instances running in different processus :s

viroulep avatar Apr 29 '22 13:04 viroulep