druid icon indicating copy to clipboard operation
druid copied to clipboard

Automatic Multi-Table Union via queryContext

Open 599166320 opened this issue 7 months ago • 2 comments

Description

In many real-world scenarios, we store data with the same schema in a single table. Over time, as data volume grows and storage costs increase, it's common to manage data growth by writing new data into additional tables (with the same schema but different retention periods), while keeping the original table unchanged.

These tables are often used across a wide range of systems—some are configured in Grafana dashboards, others are accessed by backend applications. Updating all of these usages to accommodate new tables is costly and often impractical.

You might suggest using Spark or Hadoop to reindex data into one table, but that approach consumes significant compute resources. You could also suggest manually querying across multiple tables, but updating hundreds or thousands of queries is not realistic.

Proposal

To address this, I’ve implemented support for automatically unioning multiple tables with the same schema at query time, using a custom parameter in queryContext.

Example configuration:

{
  "mergeWithTables": {
    "t_a": "t_a1,t_a2,t_a3"
  }
}

t_a is the primary table being queried.

t_a1, t_a2, and t_a3 are additional tables containing data with the same schema but different time ranges or retention policies.

With this configuration, a query on t_a will automatically union data from t_a1, t_a2, and t_a3.

How It Works

When the query reaches the Broker:

The Broker reads the mergeWithTables context.

It retrieves the timeline and segment information for all involved datasources (t_a, t_a1, t_a2, t_a3) from TimelineServerView.

It constructs the appropriate segment Sequences from Historical nodes.

Results from all sources are merged transparently and returned as a single result set.

This approach provides a powerful and low-friction way to manage table partitioning by time or cost, without requiring changes in consuming systems.

599166320 avatar May 30 '25 17:05 599166320

The context is bound to a single query - I think the problem you are desribing has a bigger scope; shouldn't this work in a more global scope?

I think another approach could be to use table(append()) to do the concatenation - and possibly a view to declare the tables to be used

kgyrtkirk avatar Jun 03 '25 08:06 kgyrtkirk

@kgyrtkirk I just looked into the "Dynamic table append" you mentioned — it's quite new and very inspiring to me. It's great.

From my perspective, leveraging context offers greater flexibility and is more lightweight. It can be scoped either to a single query or extended to the client-database session level. Most importantly, it requires no modifications to existing SQL statements, resulting in near-zero code intrusion.

In contrast, adopting "Dynamic table append" necessitates altering all relevant SQL statements. Each query must explicitly include TABLE(APPEND('table1', 'table2', 'table3')), which introduces significant intrusion into the existing codebase.

Furthermore, as you pointed out, views provide a broader application scope but lack the desired flexibility. Consider a scenario where User A has consistently queried table1. Over time, table1 is partitioned into table2 and table3, thereby reducing the volume of data in the original table1. Continuing to query the physical table1 would result in improved performance. However, if we redefine table1 as a view such as table1 = TABLE(APPEND(table1, table2, table3)), User A's queries would unnecessarily scan a larger dataset. Moreover, reusing the same name for both the physical table and the view (i.e., table1) could lead to naming conflicts and operational confusion.

599166320 avatar Jun 07 '25 02:06 599166320