pipelines [backend] Performance issue: ScheduledWorkflow is taking significant amount of etcd storage

Environment

How did you deploy Kubeflow Pipelines (KFP)? Full Kuebflow deployment using manifests
KFP version: 2.0.0b6
KFP SDK version: 2.0.0b6

Steps to reproduce

We have around 125 recurring runs within a single namespace. After a few months of historical runs, we have started seeing performance issues in the k8s cluster.

After digging deeper, we found that we are seeing timeouts in the calls to etcd. When we checked the etcd database for objects we found that one particular namespace which has 125 recurring runs is taking 996MB of etcd space

some data to look at:

Entries by 'KEY GROUP' (total 1.6 GB):
+----------------------------------------------------------------------------------------------------------+--------------------------------+--------+
|                                                KEY GROUP                                                 |              KIND              |  SIZE  |
+----------------------------------------------------------------------------------------------------------+--------------------------------+--------+
| /registry/kubeflow.org/scheduledworkflows/<namespace1>                        | ScheduledWorkflow              | 996 MB |
| /registry/kubeflow.org/scheduledworkflows/<namespace2>                       | ScheduledWorkflow              | 211 MB |
| /registry/kubeflow.org/scheduledworkflows/<namespace3>                           | ScheduledWorkflow              | 118 MB |

.....

namespace1 has 123 recurring runs namespace2 has 40 recurring runs namespace3 has 63 recurring runs

Expected result

Looks like we are storing a lot of unnecessary information in the ScheduledWorkflow object, which eventually is taking space in etcd database resulting in all the performance issues

Materials and Reference

Impacted by this bug? Give it a 👍.

Jan 25 '23 04:01 deepk2u

/assign @gkcalat

Jan 26 '23 23:01 connor-mccarthy

Hi @deepk2u! It may be due to insufficient resource provisioning or the lack of etc maintenance (see here). How longs did it take for you to reach these numbers?

Jan 27 '23 00:01 gkcalat

It's an eks cluster. We connected with AWS support and they are maintaining the cluster and running defragmentation and doing all kinds of maintenance for etcd.

the oldest object I have on the list is from 14th July 2022.

Jan 31 '23 21:01 deepk2u

Can check how large are your pipeline manifests used in these recurring runs?

Feb 01 '23 00:02 gkcalat

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Aug 26 '23 07:08 github-actions[bot]

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

Nov 25 '23 07:11 github-actions[bot]

/reopen

Jan 05 '24 02:01 kuldeepjain

@kuldeepjain: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Jan 05 '24 02:01 google-oss-prow[bot]

/reopen

Jan 05 '24 02:01 deepk2u

@deepk2u: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Jan 05 '24 02:01 google-oss-prow[bot]

Closing this issue. No activity for more than a year.

/close

Apr 03 '24 16:04 rimolive

@rimolive: Closing this issue.

In response to this:

Closing this issue. No activity for more than a year.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Apr 03 '24 16:04 google-oss-prow[bot]

/reopen

We found this issue in KFP 2.0.5. We'll work on a pruning mechanism for pipeline run k8s objects.

Jun 17 '24 21:06 rimolive

@rimolive: Reopened this issue.

In response to this:

/reopen

We found this issue in KFP 2.0.5. We'll work on a pruning mechanism for pipeline run k8s objects.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Jun 17 '24 21:06 google-oss-prow[bot]

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Aug 17 '24 07:08 github-actions[bot]

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

Sep 09 '24 07:09 github-actions[bot]

/reopen

I am seeing this issue in KFP 2.2.0. Our cluster has ~50 ScheduledWorkflows and we are seeing ~400MB of bytes written to ETCD every 10 minutes for the /registry/kubeflow.org/scheduledworkflows prefix.

The culprit seems to the heartbeat status updates, e.g.:

Status:
  Conditions:
    Last Heartbeat Time:   2024-09-19T11:16:33Z
    Last Transition Time:  2024-09-19T11:16:33Z
    Message:               The schedule is disabled.
    Reason:                Disabled
    Status:                True
    Type:                  Disabled

These status updates mean the SWF objects fail to reconcile, resulting in the following reconciliation loop:

SWF is added to controller work queue
Controller processes the SWF and updates the status heartbeat and transition time to current time.
Object is re-written to ETCD and resourceVersion is updated.
Controller event handler re-adds the SWF to the work queue.

This reconciliation loop occurs every 10 seconds for every SWF on the cluster (note: the reason it's 10s and not 1s is because of the controller's default queue backoff, so events are always queued for a minimum of 10s).

How is the heartbeat time and transition time used in Kubeflow? If they are not used, then one possible fix here would be to remove them from the status block.

cc @droctothorpe

Sep 24 '24 16:09 demarna1

@demarna1: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

I am seeing this issue in KFP 2.2.0. Our cluster has ~50 ScheduledWorkflows and we are seeing ~400MB of bytes written to ETCD every 10 minutes for the /registry/kubeflow.org/scheduledworkflows prefix.

The culprit seems to the heartbeat status updates, e.g.:
Status:
 Conditions:
   Last Heartbeat Time:   2024-09-19T11:16:33Z
   Last Transition Time:  2024-09-19T11:16:33Z
   Message:               The schedule is disabled.
   Reason:                Disabled
   Status:                True
   Type:                  Disabled
These status updates mean the SWF objects fail to reconcile, resulting in the following reconciliation loop:

SWF is added to controller work queue

Controller processes the SWF and updates the status heartbeat and transition time to current time.

Object is re-written to ETCD and resourceVersion is updated.

Controller event handler re-adds the SWF to the work queue.

This reconciliation loop occurs every 10 seconds for every SWF on the cluster (note: the reason it's 10s and not 1s is because of the controller's default queue backoff, so events are always queued for a minimum of 10s).

How is the heartbeat time and transition time used in Kubeflow? If they are not used, then one possible fix here would be to remove them from the status block.

cc @droctothorpe

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sep 24 '24 16:09 google-oss-prow[bot]

/reopen

Sep 24 '24 20:09 droctothorpe

@droctothorpe: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sep 24 '24 20:09 google-oss-prow[bot]

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Nov 24 '24 07:11 github-actions[bot]

/reopen

Nov 24 '24 12:11 droctothorpe