sqlmesh icon indicating copy to clipboard operation
sqlmesh copied to clipboard

Feature Request: Support for ClickHouse Refreshable Materialized Views (RMVs)

Open ei-grad opened this issue 9 months ago • 3 comments

1. Introduction & Problem Statement

ClickHouse offers Refreshable Materialized Views (RMVs), a powerful feature declared production-ready in its 24.10 release. Unlike traditional ClickHouse materialized views that update incrementally on new data inserts, RMVs periodically recompute their defining query over the entire dataset. The results are stored in a dedicated target table, which is updated based on the RMV's refresh policy. This makes RMVs ideal for complex transformations, aggregations, and joins not suited for incremental updates. The refresh cadence and inter-RMV dependencies are defined directly in the DDL using REFRESH EVERY... [DEPENDS ON ...] syntax, with ClickHouse managing the periodic refresh execution. RMVs also support an APPEND mode, allowing new data to be added to the target table with each refresh, as an alternative to the default REPLACE behavior.

Currently, SQLMesh lacks native support for defining, deploying, and managing the lifecycle of ClickHouse RMVs, including their refresh mode. Users wishing to leverage RMVs must manage their DDL and refresh logic outside SQLMesh, forfeiting the benefits of SQLMesh's comprehensive data transformation governance, including DAG management, environment handling, schema evolution, and automated deployments.

1.1. Illustrative User Stories

  • As a Data Analyst, I want to define a complex daily aggregated sales report as a ClickHouse RMV within SQLMesh, specifying its refresh schedule directly in the model. This would allow me to rely on SQLMesh for DDL management and seamlessly integrate the RMV into my existing SQLMesh project.
  • As a Data Engineer, I want to create an RMV that joins multiple large tables and refreshes hourly, managed by SQLMesh. This would enable our BI tools to query fresh, pre-aggregated data quickly without heavily loading source tables, while I manage its schema and dependencies through SQLMesh's version control and deployment processes.

1.2. Current SQLMesh Support for ClickHouse Materialized Views & Scope of this Request

SQLMesh's current ClickHouse integration does not offer dedicated, built-in support for any type of ClickHouse Materialized View (neither traditional trigger-based MVs nor RMVs) as first-class model kinds.

  • Challenges with Traditional ClickHouse MVs & SQLMesh Paradigm:

    • Traditional ClickHouse MVs act as AFTER INSERT triggers on a source table, populating a separate target table. Users query this target table, not the MV object itself. This two-object nature (MV definition as a trigger + a separate target table) and the trigger-based population mechanism differ significantly from SQLMesh's typical model paradigm, where SQLMesh builds and directly populates the defined dataset (table or view).
    • Consequently, managing the DDL for the MV "trigger" and its associated target table as a cohesive, fully managed SQLMesh entity is not currently supported.
  • Scope of this Feature Request: This request focuses specifically on adding support for ClickHouse Refreshable Materialized Views (RMVs). RMVs, with their engine-managed refresh schedules defined in DDL and their results stored in a queryable target table, align much better with SQLMesh's MANAGED model concept (similar to Snowflake's auto-refreshing MVs). This makes RMV support a logical extension of SQLMesh's capabilities. Support for traditional, trigger-based ClickHouse MVs would require a distinct design and is outside the scope of this proposal.

2. Proposed Solution

We propose adding native support for ClickHouse Refreshable Materialized Views in SQLMesh. The core architectural approach is to treat ClickHouse RMVs as a type of "Managed Model". In this paradigm, SQLMesh is responsible for the DDL definition and schema lifecycle management (including the refresh mode), while the ClickHouse engine is responsible for executing the data refreshes based on the DDL-defined policy.

3. Detailed Design & Requirements

3.1. Model Kind and Architectural Approach

  • Leveraging/Extending MANAGED Model Mechanics (Preferred): Given that ClickHouse RMVs have their refresh lifecycle managed by the ClickHouse engine (based on the REFRESH EVERY ... clause in their DDL), their management by SQLMesh aligns closely with the principles of SQLMesh's existing MANAGED models (e.g., for Snowflake auto-refreshing objects). This also mirrors how other tools like dbt approach engine-managed materialized views, where the tool focuses on DDL deployment and the database engine handles the refresh execution.
    • A model for a ClickHouse RMV could use kind: MANAGED (or a more specific variant like MANAGED_MATERIALIZED_VIEW) with dialect: 'clickhouse'.
    • The SQLMesh ClickHouse adapter would interpret model properties to generate the correct RMV DDL, handle RMV-specific ALTER statements, and orchestrate ad-hoc refreshes via SYSTEM REFRESH VIEW.
    • This approach uses SQLMesh's understanding that it does not need to schedule or issue periodic refresh commands for such models, as the engine handles this.

3.2. SQLMesh Model Definition

Users will define RMVs in SQLMesh model files. Key properties include:

  • query: The RMV's SELECT query.
  • dialect: Must be 'clickhouse'.
  • materialized_properties (or a similar dedicated block like managed_properties or clickhouse_rmv_properties):
    • refresh_policy (string, required): Specifies the engine-managed refresh schedule and dependencies (e.g., 'EVERY 1 HOUR', 'EVERY 1 DAY DEPENDS ON my_project.another_rmv_model_name'). SQLMesh resolves model names to database object names.
    • refresh_mode (string, optional, default: 'REPLACE'): Defines how data is written to the target table on refresh.
      • 'REPLACE': (Default) The target table's content is atomically replaced.
      • 'APPEND': New query results are appended to the target table.
    • target_table_name (string, optional): Name for the TO [db.]table_name clause. SQLMesh can derive if omitted (e.g., view_name__target).
    • target_table_engine (string, optional): ClickHouse engine for the target table (e.g., 'MergeTree()'). Crucial for APPEND mode.
    • target_table_order_by / partition_by / primary_key (string or list, optional): Configuration for the target table's structure.
    • target_table_settings (dict, optional): Additional settings for the target table (e.g., {'index_granularity': 8192}).

3.3. DDL Generation

SQLMesh's ClickHouse adapter will generate the CREATE MATERIALIZED VIEW DDL, incorporating REFRESH <policy>, the conditional APPEND keyword, and the TO <target_table> clause. SQLMesh will manage the creation of the target_table_name with its specified properties, typically before creating the RMV:

CREATE MATERIALIZED VIEW [IF NOT EXISTS] [db.]<view_name>
REFRESH <refresh_policy_from_model>
[APPEND] TO [db.]<target_table_name>
AS
<model_query>;

3.4. Lifecycle Management

  • Creation: sqlmesh plan/apply creates the target table (with its engine, order by, etc.) and then the RMV pointing to it. ClickHouse manages scheduled refreshes.
  • Atomic Replacement (for REPLACE mode): When an RMV in REPLACE mode refreshes, ClickHouse computes the new result set and atomically swaps it with the previous content of the target table. This ensures queries against the RMV always see a consistent state and experience minimal disruption. SQLMesh relies on this inherent ClickHouse behavior and is responsible for defining the RMV correctly to enable it.
  • Alteration:
    • Changes to refresh_policy: SQLMesh generates ALTER TABLE [db.]<view_name> MODIFY REFRESH <new_refresh_policy>;.
    • Changes to refresh_mode, the query, or fundamental target_table_name properties: These likely require a DROP and CREATE sequence for the RMV (and potentially the target table if its schema changes). SQLMesh will manage this, clearly indicating implications (e.g., data handling for the target table).
  • Deletion: sqlmesh drop <model_name> executes DROP MATERIALIZED VIEW [IF EXISTS] [db.]<view_name>; and optionally DROP TABLE [IF EXISTS] [db.]<target_table_name>; if the target table is exclusively managed by this RMV model.

3.5. Manual (Ad-hoc) Refresh Trigger (sqlmesh refresh <model_name>)

  • This command translates to SYSTEM REFRESH VIEW [db.]<view_name>; in ClickHouse.
  • Corner Case Explanations:
    • Concurrent Refreshes: ClickHouse typically allows only one refresh operation per RMV at a time. If sqlmesh refresh is issued while an engine-managed refresh for the same RMV is in progress, ClickHouse will likely queue or reject the SYSTEM REFRESH VIEW command. SQLMesh could potentially check system.view_refreshes before issuing the command or rely on ClickHouse's handling, logging the outcome.
    • Dependencies (DEPENDS ON): If sqlmesh refresh is called on an RMV with DEPENDS ON clauses, the SYSTEM REFRESH VIEW command will trigger its refresh. ClickHouse's internal logic will then ensure this refresh waits for its specified dependencies if they are also due for refresh or are actively refreshing. sqlmesh refresh for a single model does not automatically cascade to its dependencies unless a future SQLMesh enhancement (e.g., --cascade) provides such functionality.
    • Interaction with APPEND mode: sqlmesh refresh on an APPEND mode RMV triggers another append of the current query's results to the target table.

3.6. Integration with SQLMesh Features

  • Standard Integrations: RMVs will integrate with SQLMesh's DAG, lineage, audits, tests, cron property (for SQLMesh-side tasks like audits, not for driving the RMV's DDL-based refresh), and environments.
  • ClickHouse Version Handling:
    • The adapter will target ClickHouse versions with production-ready RMVs (e.g., 24.10+).
    • For older versions requiring allow_experimental_refreshable_materialized_view = 1, SQLMesh will provide a connection-level configuration option in the project's SQLMesh config (e.g., gateways.<gateway_name>.connection_settings: { 'allow_experimental_refreshable_materialized_view': '1' } or a boolean flag like enable_experimental_rmv: true). If set, SQLMesh ensures this setting is applied before RMV DDL execution.
  • Documentation: SQLMesh documentation must clearly explain refresh_mode: 'APPEND' implications, especially regarding target table engine choice and query design for data consistency.

4. Example SQLMesh Model Definition

MODEL (
  name my_project.daily_sales_snapshots_rmv,
  dialect 'clickhouse',
  kind MANAGED,
  owner 'analytics_engineering',
  tags ['reporting', 'sales', 'clickhouse_rmv', 'snapshot'],
  start '2023-01-01',
  managed_properties ( -- Or a ClickHouse-specific block
    refresh_policy 'EVERY 1 DAY OFFSET \'1 HOUR\'',
    refresh_mode 'APPEND',
    target_table_name 'daily_sales_snapshots_target',
    target_table_engine 'MergeTree()',
    target_table_order_by '(snapshot_date, product_category)',
    target_table_partition_by 'toYYYYMM(snapshot_date)'
  )
);

SELECT
    today() AS snapshot_date,
    p.category AS product_category,
    COUNT(DISTINCT s.order_id) AS total_orders_for_yesterday,
    SUM(s.sale_amount) AS total_revenue_for_yesterday
FROM my_project.stg_sales AS s
JOIN my_project.stg_products AS p ON s.product_id = p.product_id
WHERE toDate(s.sale_timestamp) = yesterday() -- Query for appending daily snapshot
GROUP BY snapshot_date, product_category;

5. Key Benefits

  • Unified Management: Manage ClickHouse RMVs alongside all other data models within SQLMesh.
  • Declarative Definition: Clearly define RMVs, their refresh policies, and modes in version-controlled SQLMesh models.
  • CI/CD & Version Control: Apply Git-based workflows to RMV schema and DDL deployments.
  • Enhanced Lineage: Visualize RMV dependencies within the SQLMesh DAG.
  • Safe Environments: Utilize SQLMesh environments for developing and testing RMV changes.
  • Operational Efficiency: Automate RMV DDL management, reducing manual errors.
  • Consistency: Ensure uniform RMV definitions across development, staging, and production.
  • Advanced Use Cases: Native support for APPEND mode enables historical tracking and snapshotting.

6. Conclusion

Native SQLMesh support for ClickHouse Refreshable Materialized Views, leveraging the "Managed Model" paradigm and addressing specifics like APPEND mode and ad-hoc refreshes, will be a valuable addition. It empowers users to harness advanced ClickHouse features within a robust, declarative, and version-controlled data transformation framework, promoting reliable and maintainable data pipelines.

ei-grad avatar May 19 '25 11:05 ei-grad

Hey @ei-grad! First of all, thank you for such an excellent proposal. It contains just the right amount of detail and motivation. I really appreciate it.

I have a few questions / suggestions.

materialized_properties (or a similar dedicated block like managed_properties or clickhouse_rmv_properties):

I'm pretty sure we can reuse the existing physical_properties attribute for this. I believe the purposes of this existing attribute aligns with what you're trying to achieve here.

Deletion: sqlmesh drop <model_name>

There's currently no such command in SQLMesh. Instead SQLMesh relies on a janitor process which runs periodically and cleans up the old environments and / or versioned (physical) tables. Is there any reason we can't continue relying on the existing process?

Manual (Ad-hoc) Refresh Trigger (sqlmesh refresh <model_name>)

Again, there's currently no such command in SQLMesh, nor is there an overall process for manually refreshing specific models beyond the sqlmesh run command. I'm still not sure I fully understand why this is needed provided the "managed" nature of RMVs. I thought no manual intervention was necessary once the table was created.

Additionally, I'm a little concerned about adding an entirely new command and an additional flow to support a specific model kind for just one engine.

Otherwise, this proposal makes a lot of sense to me, and I believe it will be a great addition to the product and greatly benefit ClickHouse users.

PS: Sorry about the delayed reply. I took my time to digest this.

izeigerman avatar May 26 '25 16:05 izeigerman

Looking forward for this feature... any timelines?

pkotikalapudi avatar Aug 15 '25 15:08 pkotikalapudi

Looking forward for this feature... any timelines?

+1

finnoio avatar Sep 08 '25 10:09 finnoio