feat(backend): add gRPC metrics to api-server (RPS/latency), optimize execution spec reporting
Description of your changes:
-
Add standard gRPC RPS and latency metrics to each API server endpoint. There are a lot of manually added metrics in run_server.go. This PR replaces most of them with standardized gRPC metrics. Additionally, many services in the api-server previously had no metrics at all — this PR adds basic observability for them as well. I kept the old metrics to maintain backward compatibility with existing dashboards.
-
Add metrics to measure the delay between Argo Workflow creation for recurring runs and its reporting to the API server by the persistence agent
-
Optimize patchExistingTasks MySQL query by using the runId field to leverage the existing index (since podName is not covered by the index)
ReportWorkflowV1 Performance After Optimization (1.5M Tasks in MySQL):
Metrics will be available after the PR is merged.
This PR was motivated by an issue that emerged as the number of tasks in our system increased. Recurring runs were created successfully, but their status was not updated for an extended period of time. The same issue also affected regular (one-off) runs, which also experienced delays in status reporting.
Hi @ntny. Thanks for your PR.
I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.
Once the patch is verified, the new status will be reflected by the ok-to-test label.
I understand the commands that are listed here.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
🚫 This command cannot be processed. Only organization members or owners can use the commands.
/ok-to-test
lgtm, just curious if there is a reason behind sticking with this go-grpc-prometheus package when its github page has it listed as deprecated.
This project is depreacted and archived as the functionality moved to go-grpc-middleware repo since provider/[email protected] release. You can pull it using go get github.com/grpc-ecosystem/go-grpc-middleware/providers/prometheus. The API is simplified and morernized, yet functionality is similar to what v1.2.0 offered. All questions and issues you can submit here.
Is it possible to leverage the modules/structs from the middleware package instead?
lgtm, just curious if there is a reason behind sticking with this go-grpc-prometheus package when its github page has it listed as deprecated.
This project is depreacted and archived as the functionality moved to go-grpc-middleware repo since provider/[email protected] release. You can pull it using go get github.com/grpc-ecosystem/go-grpc-middleware/providers/prometheus. The API is simplified and morernized, yet functionality is similar to what v1.2.0 offered. All questions and issues you can submit here.
Is it possible to leverage the modules/structs from the middleware package instead?
- I was a bit concerned about that too, but decided to stick with the v1 version since the project already uses github.com/grpc-ecosystem/go-grpc-middleware v1.3.0. Using go-grpc-prometheus/v2 would require upgrading to the v2 middleware stack, which I figured would be better handled in a separate merge request with proper testing.
- Theoretically, yes but this requires implementing all metrics manually with matching naming and building a custom interceptor. That adds some maintenance overhead. It might be an option to explore if there are concerns about using a package.
- I was a bit concerned about that too, but decided to stick with the v1 version since the project already uses github.com/grpc-ecosystem/go-grpc-middleware v1.3.0. Using go-grpc-prometheus/v2 would require upgrading to the v2 middleware stack, which I figured would be better handled in a separate merge request with proper testing.
- Theoretically, yes but this requires implementing all metrics manually with matching naming and building a custom interceptor. That adds some maintenance overhead. It might be an option to explore if there are concerns about using a package.
Makes sense, I'm okay with using the go-grpc-prometheus package if we can get an issue for tracking the upgrade of it / the middleware stack.
Any thoughts from your end @HumairAK?
@ntny I made a PR to update go-grpc-middleware, so once that's merged, you can rebase your PR to use v2: https://github.com/kubeflow/pipelines/pull/12043
I prefer this approach since it seems straight forward to update it rather than add another unsupported library to KFP.
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please ask for approval from humairak. For more information see the Kubernetes Code Review Process.
The full list of commands accepted by this bot can be found here.
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
/hold
need to configure metric registration with the new go-grpc-middleware/providers/prometheus api
/unhold
@HumairAK for approval
Rebased. Thanks for the contribution, @ntny! You've been on fire! 🔥
/lgtm
Description of your changes:
* Add standard gRPC RPS and latency metrics to each API server endpoint. There are a lot of manually added metrics in [run_server.go](https://github.com/kubeflow/pipelines/blob/master/backend/src/apiserver/server/run_server.go#L38). This PR replaces most of them with standardized gRPC metrics. Additionally, many services in the api-server previously had no metrics at all — this PR adds basic observability for them as well. I kept the old metrics to maintain backward compatibility with existing dashboards. * Add metrics to measure the delay between Argo Workflow creation for recurring runs and its reporting to the API server by the persistence agent * Optimize patchExistingTasks MySQL query by using the runId field to leverage the existing index (since podName is not covered by the index)ReportWorkflowV1 Performance After Optimization (1.5M Tasks in MySQL):
Metrics will be available after the PR is merged.
This PR was motivated by an issue that emerged as the number of tasks in our system increased. Recurring runs were created successfully, but their status was not updated for an extended period of time. The same issue also affected regular (one-off) runs, which also experienced delays in status reporting.
Wonderful optimization @ntny , do you also want to take a look at the scheduled workflow controller in general? With over 1000 scheduled workflows it becomes really slow even if all are disabled.
Wonderful optimization @ntny , do you also want to take a look at the scheduled workflow controller in general? With over 1000 scheduled workflows it becomes really slow even if all are disabled.
Hi! Could you please clarify where exactly you’re seeing the performance issue? during scheduling, during reporting, somewhere in the controller’s reconciliation loop, or elsewhere? Or you can just describe the symptoms.
If the slowdown happens during reporting the run status to the API server, then this MR should address it. Slow reporting to the API server can break runs for scheduled workflows, because their pipeline runs are created during the reporting phase and may not be stored in the API server in time before the execution starts.
If you describe the problem in more detail, I’d be happy to take a look!
/lgtm
@CarterFendley FYI
Wonderful optimization @ntny , do you also want to take a look at the scheduled workflow controller in general? With over 1000 scheduled workflows it becomes really slow even if all are disabled.
Hi! Could you please clarify where exactly you’re seeing the performance issue? during scheduling, during reporting, somewhere in the controller’s reconciliation loop, or elsewhere? Or you can just describe the symptoms.
If the slowdown happens during reporting the run status to the API server, then this MR should address it. Slow reporting to the API server can break runs for scheduled workflows, because their pipeline runs are created during the reporting phase and may not be stored in the API server in time before the execution starts.
If you describe the problem in more detail, I’d be happy to take a look!
During scheduling. If you have 1500 scheduled workflows the scheduled workflow controller takes over 10 minutes to schedule newly created ones. I think the queries per seconds to the apiserver might be too low at 5/10.
During scheduling. If you have 1500 scheduled workflows the scheduled workflow controller takes over 10 minutes to schedule newly created ones. I think the queries per seconds to the apiserver might be too low at 5/10.
Got it. I haven’t run into this myself, but it should be easy to reproduce. Just schedule 1K+ runs and keep adding new ones. I’ll take a look next week and let you know.
During scheduling. If you have 1500 scheduled workflows the scheduled workflow controller takes over 10 minutes to schedule newly created ones. I think the queries per seconds to the apiserver might be too low at 5/10.
Got it. I haven’t run into this myself, but it should be easy to reproduce. Just schedule 1K+ runs and keep adding new ones. I’ll take a look next week and let you know.
Thank you. You can also use 2k disabled scheduled workflows and then create a single enabled scheduled workflow.
