racetrack icon indicating copy to clipboard operation
racetrack copied to clipboard

Cache PUB internal calls

Open iszulcdeepsense opened this issue 2 years ago • 6 comments

On each Job call, PUB component makes a request to Lifecycle to check if a user is allowed to speak to a job and to get the internal address of the job to forward the request to. Lifecycle, in turn, makes calls to a database, which can perform bad under higher load.

To improve PUB's performance, these internal calls could be cached with a short time-to-live (eg. 1 minute). The composite key of this cache composes of: user auth token, job name, job version, job endpoint.

It can be cached either by PUB or by Lifecycle, though it makes more sense to cache it by PUB.

iszulcdeepsense avatar Mar 17 '23 10:03 iszulcdeepsense

which can perform bad under higher load

Has this actually been the case?

I'm not against caching at all, but I typically prefer to try as many other things as possible before. Off the top of my head:

  • Consolidating DB calls (e.g. if we always make the same two calls in succession, group them in one)
  • Checking if the connection pooler is doing the right thing keeping the connection "warm" so we reduce connection overhead
  • Measuring how much of the performance budget on a DB connection is in fact unavoidable infrastructure latency (VPN, network latency, other encryption such as TLS)

When we start caching, we will pay a price in more difficult troubleshooting later, and easier to make mistakes which can cause bugs or security issues (e.g. anything touching user accounts will need to remember to update the cache also).

JosefAssadERST avatar Mar 20 '23 06:03 JosefAssadERST

Has this actually been the case?

To be honest, it's just my conjecture. It's been proven that a single request to Racetrack v2 can be slower than v1 due to the external database. This idea is just a trick to mitigate this, but I don't think it's a problem right now.

At this point, let's leave it as it is. This issue can be even classified as a "premature optimization". Plus, I agree, this could make troubleshooting and testing more difficult.

iszulcdeepsense avatar Mar 20 '23 07:03 iszulcdeepsense

By the way if this really is such a big performnce problem, another way to do it without running into risk of race conditions and fiddling with TTL is for PUB to keep a copy of the auth table in memory, and when there's a change to the auth table, lifecycle pushes this change to PUB.

What's better than this is not doing this at all, but if there really is a demonstrable performance problem (and not just us being annoyed there's more calls when there's no real price to pay for those resources) then this model is more robust.

JosefAssadERST avatar Aug 22 '24 10:08 JosefAssadERST

I like this idea as an effective mitigation of unstable database connections. In case of connection loss, Pub could stand the outages for a short time (if that would be the case to solve) by using in-memory cache. I think that improving availability, more than performance, would be a driving factor to make this issue happen one day.

iszulcdeepsense avatar Aug 22 '24 11:08 iszulcdeepsense

It also sounds like a fun thing to implement.

anders314159 avatar Aug 22 '24 12:08 anders314159

Bear in mind, I don't think you should do this unless the DB calls objectively are a problem.

I'm proposing a cache here. That has a high price in debuggability; getting a cche to work reliably is one of those things that seem simple enough but can actually trip you up in a lot of subtle ways. There's a lot of failure modes.

but if you HAVE to, this seems more reliable than caching DB calls.

JosefAssadERST avatar Aug 26 '24 03:08 JosefAssadERST