druid icon indicating copy to clipboard operation
druid copied to clipboard

Integrate the catalog with the Calcite planner

Open paul-rogers opened this issue 2 years ago • 4 comments

This PR is currently a draft. Resolving merge conflicts after splitting out some of the code to other PRs.

Prior PRs added the catalog (table metadata) foundations, and an improved set of table functions. This PR brings it all together:

  • Validates the MSQ INSERT and REPLACE statements against the catalog
    • Clustering, partitioning and other table details can be set in the catalog instead of the SQL statement
    • Catalog types are loosely enforced for MSQ. (More work is needed to precisely enforce types.)
    • The catalog can create a "sealed" table: only columns defined in the catalog can be used in MSQ.
  • Allows defining external tables and partial external tables (AKA "connections") in the catalog, then fill in the remaining details at runtime via a table function.
  • Allows parameters (including array parameters) to work with MSQ queries
  • Extends the PARTITION BY clause to accept string literals for the time partitioning
  • Extends MSQ to give the planner control over the type of the emitted segment columns
  • MSQ ITs to validate the new "ad-hoc" table functions
  • Documentation

To allow all the above to work:

  • Validation for MSQ statements moves out of the handers into a Druid-specific version of the SQL validator.
  • Druid-specific Calcite operator to represent a Druid ingest.
  • The catalog API is passed into the Druid planner (which required changes in the many tests that set up the planner).
  • The catalog can now be enabled in the Broker to allow the planner to interact with the Druid table metadata extension.
  • Many new tests to verify the catalog integration and improved MSQ statement validation.
  • Improved catalog type parsing in anticipation of supporting complex types.
  • Factored out the "per run" items from the planner into a planner toolbox, leaving just the "per session" items in the planner.
  • Resource shuttle now handles "partial table functions" for items defined in the catalog.

Release note

This PR introduces the full catalog functionality. See the documentation files for the details. In this version, the catalog is an extension: you must enable the catalog extension to use the catalog. Enabling the extension creates an additional table in your metadata database. We consider the catalog to be experimental, and the metadata table schema is subject to change.

Table functions, introduced in a prior PR, are production ready and independent of the catalog. "Partial table functions" (define some of the properties in the catalog, some in SQL) are new in this PR and are experimental, along with the catalog itself.

Hints to reviewers

Much of this PR is doc files, test code and minor cleanup. The core changes (those that could break a running system if done wrong) are:

  • extensions-core/multi-stage-query/src/main/java/org/apache/druid/msq/*
  • sql/src/main/*

The real core of this PR is sql/src/main/java/org/apache/druid/sql/calcite/planner/DruidSqlValidator.java: the place where we moved the former ad-hoc INSERT and REPLACE validation to instead run within the SQL validator.

No runtime code was changed: all the non-trivial changes are in the SQL planner.


This PR has:

  • [X] been self-reviewed.
  • [X] added documentation for new or modified features or behaviors.
  • [X] a release note entry in the PR description.
  • [X] added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • [X] added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • [X] added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • [ ] added integration tests.
  • [ ] been tested in a test Druid cluster.

paul-rogers avatar Jan 18 '23 00:01 paul-rogers

Split out many changes into separate PRs, and merged the result. Reduced the file count from 217 to 113. Still big, but not quite so big.

paul-rogers avatar Feb 27 '23 23:02 paul-rogers

@paul-rogers , this PR https://github.com/apache/druid/pull/14023 moved some things around, including docs/multi-stage-query/reference.md so that may be the cause of some merge conflicts.

techdocsmith avatar May 19 '23 18:05 techdocsmith

This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If you think that's incorrect or this pull request should instead be reviewed, please simply write any comment. Even if closed, you can still revive the PR at any time or discuss it on the [email protected] list. Thank you for your contributions.

github-actions[bot] avatar Feb 02 '24 00:02 github-actions[bot]

This pull request/issue has been closed due to lack of activity. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

github-actions[bot] avatar Mar 01 '24 00:03 github-actions[bot]

This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If you think that's incorrect or this pull request should instead be reviewed, please simply write any comment. Even if closed, you can still revive the PR at any time or discuss it on the [email protected] list. Thank you for your contributions.

github-actions[bot] avatar May 12 '24 00:05 github-actions[bot]

This pull request/issue has been closed due to lack of activity. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

github-actions[bot] avatar Jun 09 '24 00:06 github-actions[bot]