[wip]feat(backend): postgres integration
Summary
This PR adds full PostgreSQL (pgx driver) support to Kubeflow Pipelines backend, enabling users to choose between MySQL and PostgreSQL as the metadata database. The implementation introduces a clean dialect abstraction layer and includes a major query optimization that benefits both database backends.
Key achievements ✅ Complete PostgreSQL integration for API Server and Cache Server, addressing #7512, #9813 ✅ All CI tests passing (MySQL + PostgreSQL). ✅ Significant performance improvement for ListRuns queries. This PR is expected to address the root causes behind #10778, #10230, #9780, #9701 ✅ Zero breaking changes - backward compatible with existing MySQL deployments
What Changed
- Storage Layer Refactoring - Dialect Abstraction ([backend/src/apiserver/common/sql/dialect]
-
Problem SQL syntax was tightly coupled to MySQL.
-
Solution Introduced a DBDialect interface that encapsulates database-specific behavior Identifier quoting (MySQL backticks vs PostgreSQL double quotes) Placeholder styles (? vs $1, $2, ...) Aggregation functions (GROUP_CONCAT vs string_agg) Concatenation syntax (CONCAT() vs ||)
-
Files
- Core dialect implementation →
backend/src/apiserver/common/sql/dialect/dialect.go - Dialect-aware utility functions →
backend/src/apiserver/storage/sql_dialect_util.go - Reusable filter builders with proper quoting →
backend/src/apiserver/storage/list_filters.go
- Core dialect implementation →
All storage layer code now uses
q := s.dbDialect.QuoteIdentifier
qb := s.dbDialect.QueryBuilder()
This ensures queries work correctly across MySQL, PostgreSQL, and SQLite (for tests).
- ListRuns Query Performance Optimization
- Problem
The original ListRuns query called
addMetricsResourceReferencesAndTaskswhich performed a 3-layer LEFT JOIN with GROUP BY on all columns, includingLONGTEXTfields likePipelineSpecManifestWorkflowSpecManifestetc. This caused slow response times for large datasets. - Solution
Layers 1-3: LEFT JOIN only on PrimaryKey
UUID+ aggregated columns (refs, tasks, metrics) Final layer: INNER JOIN back to run_details to fetchLONGTEXTcolumns - Performance impact Eliminates GROUP BY on LONGTEXT columns entirely. Expected substantial performance improvements for deployments with large pipeline specifications, though formal load testing has not yet been conducted.
- Deployment Configurations
- Production-ready PostgreSQL kustomization →
manifests/kustomize/env/platform-agnostic-postgresql/ - Local development setup →
manifests/kustomize/env/dev-kind-postgresql/ - PostgreSQL StatefulSet →
manifests/kustomize/third-party/postgresql/
Configuration is symmetric to existing MySQL manifests for consistency.
- CI Manifest Overlays
Created CI-specific Kustomize overlays to ensure tests use locally built images from the Kind registry instead of pulling official images from ghcr.io:
- Add PostgreSQL CI overlay
.github/resources/manifests/standalone/postgresql/ - Added
kfp-cache-serverimage override to.github/resources/manifests/standalone/base/kustomization.yaml
- Added 2 PostgreSQL-specific CI workflows
- V2 API and integration tests (cache enabled/disabled matrix) →
api-server-test-Postgres.yml - V1 integration tests →
integration-tests-v1-postgres.yml
PostgreSQL tests cover the core cache enabled/disabled matrix.
- Local development support
- make dev-kind-cluster-pg - Provision Kind cluster with PostgreSQL
- Updated README for PostgreSQL setup and debugging, achieving parity with MySQL documentation.
Testing
Unit Tests
23 test files modified/added New test coverage: dialect_test.go, list_filters_test.go, sql_dialect_util_test.go All existing tests updated to use dialect abstraction
Integration Tests
✅ V1 API integration tests (PostgreSQL) ✅ V2 API integration tests (PostgreSQL, cache enabled/disabled) ✅ Existing MySQL tests remain green
Migration Guide
- For new deployments:
kubectl apply -k manifests/kustomize/env/platform-agnostic-postgresql - For existing MySQL deployments: No action required. This PR is fully backward compatible.
- For local development, to set up the kind cluster with Postgres
make -C backend dev-kind-cluster-pg
This PR continues from #12063.
Hi @kaikaila. Thanks for your PR.
I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.
Once the patch is verified, the new status will be reflected by the ok-to-test label.
I understand the commands that are listed here.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
🚫 This command cannot be processed. Only organization members or owners can use the commands.
Currently, both MySQL and PGX setups use the DB superuser for all KFP operations, which is why client_manager.go contains a “create database if not exist” step here.
From a security standpoint, would it be preferable to:
- Move DB creation out of the client manager and into the deployment/init phase (i.e. add a manifests/kustomize/third-party/postgresql/base/pg-init-configmap.yaml) and
- Introduce a dedicated restricted user for KFP components, limited to the mlpipeline database?
If the team agrees, I can propose a follow-up PR to refactor accordingly.
I'm fine with this, I don't think it's great that KFP tries to create a database (or a bucket frankly)
fyi @mprahl / @droctothorpe
Thanks, @HumairAK — totally agree on the security point. Since this PR is already getting quite heavy, would you be okay if I leave the user permission changes for a separate follow-up PR?
yes that is fine
Question about the PostgreSQL test workflow organization
Current situation
The V2 integration tests for PostgreSQL logically belong in a "PostgreSQL counterpart" to legacy-v2-api-integration-tests.yml However, I didn't want to create a new workflow with "legacy" in the name from day one. As a temporary solution, I merged them into api-server-test-Postgres.yml This causes asymmetry with api-server-tests.yml and the workflow has mixed responsibilities.
Question: What's the recommended workflow organization for PostgreSQL tests?
Should I:
- a. Create legacy-v2-api-integration-tests-postgres.yml for consistency (even though it's new)?
- b. Keep current structure and accept the asymmetry?
- c. Refactor both MySQL and PostgreSQL to a unified structure?
Would love guidance on the long-term vision for test workflow organization, especially from @nsingla
Just a note I have been working on similar functionality for Postgres support. I have modified the code so it no longer uses Squirrel and SQL generator but uses native GORM through out, this allows for CamelCase support in Table names in many cases has resulted in significantly simpler queries on the DB. By maintaining CamelCase scenarios where reflection is used to generate predicates functions as is. All test cases are passing and would be great if we could look at potentially merging with the work that has been done here. Through the use of native GORM should also be easier to add support for other databases if required. Currently inline with master branch.
Hi @nickalexander-disco,
Thanks for your comment.
There are two main reasons why I decided not to use GORM for the store package:
-
GORM excels at simple CRUD operations but not at complex queries. The store layer in KFP is dominated by complex query patterns — dynamic SQL generation, multi-table JOINs, subqueries, aggregations, and conditional filters. These cannot be expressed cleanly in GORM without dropping back to raw SQL, which defeats the purpose of using an ORM.
-
GORM behaves as a black box, making debugging difficult. In the store layer, we often need full visibility and control over the generated SQL to debug query behavior across PostgreSQL, MySQL, and SQLite. GORM’s abstraction layer hides too much detail, and we can’t reliably reproduce or inspect SQL behavior in CI or integration tests.
Because complex queries are the dominant pattern here, I prioritized consistency of implementation — instead of mixing approaches, all queries in the store package are written using Squirrel, a transparent SQL builder that allows precise control, easy unit testing, and clear debugging.
That said, I'm open to evaluate where GORM provides advantages in practice. You mentioned that you have been working on similar functionality using native GORM which passed all tests. Could you please share:
- a concrete branch or PR with your code changes,
- CI test results showing that all tests pass. Feel free to reuse CI workflows yamls in this PR.
I’d be happy to review them and consider merging our efforts.
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please ask for approval from humairak. For more information see the Kubernetes Code Review Process.
The full list of commands accepted by this bot can be found here.
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
Question about the PostgreSQL test workflow organization
Current situation
The V2 integration tests for PostgreSQL logically belong in a "PostgreSQL counterpart" to legacy-v2-api-integration-tests.yml However, I didn't want to create a new workflow with "legacy" in the name from day one. As a temporary solution, I merged them into api-server-test-Postgres.yml This causes asymmetry with api-server-tests.yml and the workflow has mixed responsibilities.
Question: What's the recommended workflow organization for PostgreSQL tests?
Should I:
* a. Create legacy-v2-api-integration-tests-postgres.yml for consistency (even though it's new)? * b. Keep current structure and accept the asymmetry? * c. Refactor both MySQL and PostgreSQL to a unified structure?Would love guidance on the long-term vision for test workflow organization, especially from @nsingla
I actually would like to get rid of these legacy tests asap, but there are still few tests that needs to be migrated first, so my suggestion is to not add more work to the legacy workflows Rather, can we add "database" as a workflow parameter, similar to "pipeline_store", and run tests against mysql as well as postgres in the same workflow?
Question about the PostgreSQL test workflow organization
Current situation
The V2 integration tests for PostgreSQL logically belong in a "PostgreSQL counterpart" to legacy-v2-api-integration-tests.yml However, I didn't want to create a new workflow with "legacy" in the name from day one. As a temporary solution, I merged them into api-server-test-Postgres.yml This causes asymmetry with api-server-tests.yml and the workflow has mixed responsibilities.
Question: What's the recommended workflow organization for PostgreSQL tests?
Should I:
* a. Create legacy-v2-api-integration-tests-postgres.yml for consistency (even though it's new)? * b. Keep current structure and accept the asymmetry? * c. Refactor both MySQL and PostgreSQL to a unified structure?Would love guidance on the long-term vision for test workflow organization, especially from @nsingla
I actually would like to get rid of these legacy tests asap, but there are still few tests that needs to be migrated first, so my suggestion is to not add more work to the legacy workflows Rather, can we add "database" as a workflow parameter, similar to "pipeline_store", and run tests against mysql as well as postgres in the same workflow?
Hi @nsingla,
Thanks for the context. I'd like to clarify the reasoning behind the current structure of api-server-test-Postgres.yml (which includes both API Tests and Integration Tests), as it addresses the constraints we discussed:
Integration Tests: You mentioned we should avoid adding more workload to the existing legacy-v2-api-integration-tests.yml . Since the main api-server-tests.yml only runs Ginkgo tests and misses these legacy integration tests (which are critical for verifying Postgres support), I have placed them here. This ensures we have Postgres coverage without modifying the legacy workflow. API Tests: While I have added Postgres to the api-server-tests.yml matrix, I am keeping this separate workflow as a control group for now. It allows us to prove that the Postgres business logic is correct in isolation, ensuring we have a green signal while we work on stabilizing the main workflow matrix. My plan is to treat api-server-test-Postgres.yml as a temporary bridge. Once the legacy integration tests are migrated to Ginkgo and the main workflow is stable, we can consolidate everything there and remove this file.
Does this approach sound reasonable to you as an interim solution?