flyte [BUG] FlytePropeller 1.13.0 is crashing Flyteadmin and making the overall service in a bad state

Describe the bug

We've tried several times upgrading to flyte v1.13.0 and we've found some serious issues blocking the upgrade, we don't know the cause yet but overall it seems that Flyte Propeller calls to FlyteAdmin makes FlyteAdmin make many calls too two GRPC Endpoints:

GetOauth2Metadata
GetPublicClientConfig

What we see in Flyteadmin is a lot of pressure in the GRPC side and an increase in memory until it gets killed by the k8s with OOM. This in turn causes the FlytePropeller calls to fail with grpc errors and eventually also cause a crash. What we also noticed is that this versions added a retry mechanism which in our case is too aggressive but we've tuned that down.

Hard to say looking at the code but I suspect this PR introduced some kind of bug or missing config unaware to us which makes FlyteAdmin constantly trying to fetch a token (and fails).

https://github.com/flyteorg/flyte/pull/5407

When we rollback to FlytePropeller v1.11.0 things are fine and FlyteAdmin behaves nicely (still running in v1.13.0).

Expected behavior

Flyteadmin doesn't get massively impacted by grpc calls.

Additional context to reproduce

No response

Screenshots

No response

Are you sure this issue hasn't been raised already?

[X] Yes

Have you read the Code of Conduct?

[X] Yes

Jul 30 '24 14:07 RRap0so

The issue you're describing seems to be related to the interaction between FlytePropeller and FlyteAdmin, particularly with the gRPC calls to GetOauth2Metadata and GetPublicClientConfig. This is causing FlyteAdmin to experience high memory usage and eventually get killed by Kubernetes due to OOM (Out of Memory) errors.

Based on the provided data sources, here are a few steps and considerations to address this issue:

Check FlyteAdmin Configuration: Ensure that the FlyteAdmin configuration is optimized for your environment. According to this GitHub issue, you might need to adjust the resource limits and liveness probe settings for FlyteAdmin. For example, increasing the CPU and memory limits can help handle the increased load.
Adjust Retry Mechanism: You mentioned that the retry mechanism introduced in the newer version is too aggressive. Make sure you have tuned it down appropriately. This can help reduce the pressure on FlyteAdmin.
Review Authentication Setup: Ensure that the authentication setup between FlytePropeller and FlyteAdmin is correctly configured. According to this discussion, issues with token retrieval can cause repeated authentication attempts, leading to high load.
Monitor Resource Usage: Continuously monitor the resource usage of FlyteAdmin and FlytePropeller. Use Kubernetes metrics and logs to identify any spikes in memory or CPU usage.

Would you like more detailed steps on how to implement these changes or further troubleshooting steps?

Sources:

This is an AI-generated response and your feedback is appreciated! Please leave a 👍 if this is helpful and 👎 if it is not.

Mention @runllm to ask follow-up questions. See runllm.com for more!

Jul 30 '24 14:07 runllm[bot]

I think what is happening here is that the auth intercepter, provided by the NewAuthInterceptor function, calls the MaterializeCredentials function at every invocation, even if there is a token.

Inside MaterializeCredentials, the code does a couple things that maybe can be optimized.

it always creates a new authMetadataClient
and it always calls NewTokenSourceProvider Inside NewTokenSourceProvider, in the AuthTypeClientSecret case, if cfg.TokenURL is not defined, then the code will always make a call to Admin's GetOAuth2Metadata endpoint. And a similar story for the GetPublicClientConfig endpoint. it just gets called every time.

@pmahindrakar-oss can you think of any quick fixes here? I think it should be okay to cache it on first read but i don't know how often these get refreshed etc. maybe we refresh it whenever we do a token refresh?

On the Admin side, it does seem likely that the oom is caused by this repeated calling of this endpoint. That's a side effect though. Calling these two endpoints thousands of times shouldn't be a problem. We're still looking into this.

This is what a pprof looks like right now, so it feels like something prometheus related (which would make sense given how simple those two endpoints are)

but @eapolinario and I dug for a while and didn't see a culprit just yet.

Aug 06 '24 02:08 wild-endeavor

@wild-endeavor

I noticed this PR wasn't upstreamed. This PR fixed "propeller panic when talking to admin , due to client interceptor sending nil token cache".

I figured this could be a potential cause for the hammering of NewAuthInterceptor -> MaterializeCredentials -> GetOauth2Metadata + GetPublicClientConfig.

I'm unsure if this is the issue, however, as this seems like it would cause issues for most users on 1.13. Also I don't get the worker panics that this PR fixed when running without it.

Aug 06 '24 02:08 pvditt

@RRap0so

Would you have any propeller logs while it was crashing during your upgrade to 1.13? Also would you be able to share any logs of workflows that failed during the upgrade?

Aug 06 '24 02:08 pvditt

@RRap0so I think there are two issues happening here. Admin shouldn't be oom'ing, and the client that propeller uses shouldn't be making that many calls to Admin.

Wrt the second one, I asked internally and confirmed, unless you instantiate a new auth server, the details shouldn't change. To that end, until we add in a caching layer or change the logic, you can just short-circuit the logic by setting the config fields here and here. This is not ideal as you'll have these fields twice in your config - once where Admin is currently grabbing the values (to then serve in these two endpoints) and once here. Make sure you set scopes as well.

The memory leak we're still investigating. It's not clear to me how they are related. Will post back here when we have more.

Aug 06 '24 19:08 wild-endeavor

Hey @pvditt and @wild-endeavor thank you for all the help! I'll try to find some logs but honestly I don't think there were any problems with the execution of the workflows. Things were still running, what I'm not sure is if it was able to publish events to admin.

Aug 07 '24 07:08 RRap0so

I feel that it is fairly normal for clients to cache oauth metadata. That is something I'd be happy to work on as I've done it before.

Aug 16 '24 15:08 Sovietaced

Does anyone have an MVP flyte-binary values.yaml that can reproduce this issue?

Aug 19 '24 20:08 Sovietaced

@RRap0so , can you give 1.13.1 a try? https://github.com/flyteorg/flyte/pull/5686 implemented this idea of caching oauth data.

Aug 27 '24 01:08 eapolinario

@eapolinario we just gave it another go and see the same behaviour. The Oath2Metadata spikes in flyteadmin, flyteadmin OOMing and restarting.

Aug 27 '24 08:08 RRap0so

@RRap0so if the change we talked about works, I think we should close out this issue. There may be a memory leak when Admin serves up 404s in perpetuity but i think that's not the right thing to be optimizing for either. The memory issue identified in this issue with propeller was isolated to metrics, and is related to this issue here that @eapolinario found. I think we should cut a separate ticket to discuss metric life-cycle with @EngHabu as that discussion is pretty orthogonal. what do you guys think?

Aug 27 '24 21:08 wild-endeavor

@wild-endeavor so far so good. We've update our code with the latest flyeidl libs!. I'm going to ahead and close thise one.

Sep 02 '24 14:09 RRap0so