flyte icon indicating copy to clipboard operation
flyte copied to clipboard

[Core feature] Delete terminated workflows in chunks during garbage collection

Open jeevb opened this issue 3 years ago • 5 comments

Motivation: Why do you think this is important?

A large number of FlyteWorkflow objects may overwhelm Flyte's garbage collection routine. This is because the garbage collector works by first listing all objects in the respective namespaces. This operation will time out in the event that there is a large number of objects in a given namespace.

Given that FlytePropeller watches for new or updated FlyteWorkflow objects in the namespaces assigned to it, when any of these namespaces has a large number of objects, the ListAndWatch operation will timeout as well. This causes the whole workflow engine to grind to a halt when the number of FlyteWorkflow objects blows up beyond what the garbage collector can handle! See below for log:

I0124 16:52:23.295697       1 trace.go:205] Trace[1562460260]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167 (24-Jan-2022 16:51:53.294) (total time: 30000ms):
Trace[1562460260]: [30.000707142s] [30.000707142s] END
E0124 16:52:23.295723       1 reflector.go:138] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: Failed to watch *v1alpha1.FlyteWorkflow: failed to list *v1alpha1.FlyteWorkflow: Get "https://192.168.3.1:443/apis/flyte.lyft.com/v1alpha1/flyteworkflows?labelSelector=termination-status+notin+%28terminated%29&limit=500&resourceVersion=0": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

Goal: What should the final outcome look like, ideally?

The garbage collector should limit the number of terminated workflows it lists/deletes every tick. This will avoid timeouts in the event that there are a large number of objects in a given namespace, and as such will be able to complete successfully, even if it may take longer.

Describe alternatives you've considered

We considered a cronjob that manually does chunked deletion of terminated workflows to work around this issue, but we believe that this is better to be fixed in FlytePropeller.

Propose: Link/Inline OR Additional context

No response

Are you sure this issue hasn't been raised already?

  • [X] Yes

Have you read the Code of Conduct?

  • [X] Yes

jeevb avatar Feb 12 '22 16:02 jeevb