druid Enhancing Query Context Handling: Introducing SQL-Level SETTINGS, Raw SQL Support, System Table and Context Validation

Related PRs

Support of SET statement: #17894
Raw SQL on HTTP endpoints: #17937
a sys.settings system table and validation of query parameters: #18087
[ ] TO DO: SETTINGs subclause on SQL

Motivation

Query context is part of query request. Under current implementation, query context and SQL are seperated. It makes sense for native query, where query and query context are kept in separated fields. However, for SQL, such design imposes complexity of SQL request -- we have to write SQL in JSON way.

{
  "query":"SELECT page, COUNT(*) AS Edits FROM wikipedia WHERE \"__time\" BETWEEN TIMESTAMP '2015-09-12 00:00:00' AND TIMESTAMP '2015-09-13 00:00:00' GROUP BY page ORDER BY Edits DESC LIMIT 10",
  "context":  {
    "enableParallelMerge": false
  }
}

This is NOT the straightforward way to use SQL.

Under the web-cosole, the console helps us encapsulate the query into the JSON, however, there're still problems:

After writting SQL, we have to use 'Edit Context' feature on web-console to customize settings.
query context is query level instead of client side application level. web-console remembers the edited query context for furture queries, which might not be what we expect. Sometimes we forget to reset the query context, or we have to delete the query context manually.

Another probloem is that, there's no validation of the query context items. We can put ANYTHING in the query context, if there's typo of query context attribute name, Druid DOES NOT tell us about it.

Last but not the least, we DON'T know which query context properties are supported by Druid, we have to read through different documents to know what query context properties are supported, such as:

https://druid.apache.org/docs/latest/querying/searchquery#query-context
https://druid.apache.org/docs/latest/querying/query-context
https://druid.apache.org/docs/latest/querying/groupbyquery#advanced-configurations

Proposal

Let's solve these problem together.

Firstly, let's introduce a `SETTINGS` subclause in the SQL statement.

This subclause accepts a list of key-value pair, where each key is the support query context property while the value is the corresponding value of that property. For example:

SELECT * FROM wikipedia
SETTINGS enableParallelMerge = false, sqlOuterLimit = 10

Since query context now is part of SQL, it's naturally for users to add/append query context properties per query as they want.

Some other databases solves this problem in different ways.

For OLTP database like MySQL, it provides SET statement to allow users to change session level variables. Since Druid has no 'session' concept because queries are executed on HTTP connection, such alternative is NOT applicable for Druid
Some databases like StarRocks, allows users customize variables in SQL hint, like:

SELECT /*+ SET_VAR(query_mem_limit = 8589934592) */ name FROM people ORDER BY name;

This does not require changes of SQL parser, but the biggest disadvantage is it's not user friendly.

SQL Server provides a OPTION subclause as query hint, which is similar to the proposal

SELECT * FROM FactResellerSales
OPTION (LABEL = 'q17');

The proposed change is not easy in Druid as it requires us to customize Calcite by editing the file sql/src/main/codegen/config.fmpp What the parser does is converting the settings clause into a QueryContext object internally.

Secondly, let's improve the `/druid/v2/sql` endpoint by allowing Druid accept raw text SQL instead of only JSON format.

If the Content-Type is given as application/json, which is current behaviour, Druid treats the input as JSON, or it treats the entire input as raw SQL text.

Under this mode, we can send SQLs to Druid in much simpler way:

curl -X 'POST' -d 'SELECT * FROM wikipedia SETTINGS enableParallelMerge = false, sqlOuterLimit = 10'  http://localhost:8888/druid/v2/sql

Thirdly, inside the Druid, let's define a `sys.settings` system table to hold all query context properties.

We should put all query context properties together and register them into this table so that query context properties can be managed in a single place.

The schema of this should be sth as follows:

Column Name	Type	Description
name	String	query context property name
type	String	type of this property
default_value	String	The default value of this property is it's not given in user's query
description	String	The description of this property

With this table:

it's very easy for users to know how many properties/what kind of properties are supported in the query context. No need to check documents as the default document pages matches the latest the version which might be different from the version users are using. Querying from sys.settings table always tell them which properties are supported
web-console can also use this system table for better code completion and user experience

Forthly, Druid MUST verify if query context properties given by user queries are valid

When a query comes into Druid, it should verifies if given query context properties are pre-defined and valid. It MUST reject any queries with bad query context settings.

The above changes 1,2,3 are independent(so they can be done separately) while the validation of query context attributes might share the same internal data structure of sys.settings table.

Mar 02 '25 09:03 FrankChen021

I like all of these ideas. They would help greatly with usability. About the first idea,

Firstly, let's introduce a SETTINGS subclause in the SQL statement.

Actually, at one point last year I did look into doing this, and I wrote a proof of concept that enabled syntax like:

REPLACE INTO wikipedia2 OVERWRITE ALL
SELECT * FROM wikipedia
PARAMETER maxNumTasks => '8'

I did the PoC for INSERT and REPLACE only, not SELECT, because we don't currently have a custom SELECT statement. Although I think we could add one similar to our custom EXPLAIN statement, by adding entries to config.fmpp and adding a select.ftl similar to explain.ftl.

I didn't continue working on this PR at the time because I ran into a problem: the approach I used didn't work for all context parameters. I applied them to the context used for execution, which works for runtime properties, but doesn't work for properties that affect SQL planning (such as useApproximateCountDistinct). Some more work would be needed to adjust it to apply the parameters earlier in the process, so they could affect SQL planning too.

Anyway, it would be great to pick this back up. My branch no longer merges cleanly against master, but I can see if I can fix that. Or if anyone's interested in looking into that, I can put up the old branch as-is.

Apr 01 '25 23:04 gianm

Any reason not to just allow requests to contain SET statements to be included prior to a insert/replace/select statement? that seems to solve the problem of needing things prior to planning the insert/replace/select because we could parse them individually and build up the context before planning.

Like allow the request to contain something like:

SET maxNumTasks 8;
REPLACE INTO wikipedia2 OVERWRITE ALL
SELECT * FROM wikipedia

or:

SET enableParallelMerge false;
SET sqlOuterLimit 10;
SELECT * FROM wikipedia

This syntax is used by lots of other databases too:

https://learn.microsoft.com/en-us/sql/t-sql/statements/set-statements-transact-sql?view=sql-server-ver16
https://www.postgresql.org/docs/current/sql-set.html
https://dev.mysql.com/doc/refman/8.4/en/set-statement.html
https://docs.pinot.apache.org/users/user-guide-query/query-options#set-statement
https://clickhouse.com/docs/sql-reference/statements/set
https://docs.vertica.com/23.3.x/en/sql-reference/statements/set-statements/

In this model we would consider the http request a 'session' and the allowed set of statements restricted to being 0 or more SET statements and then a single insert/replace/select statement.

Apr 01 '25 23:04 clintropolis

Any reason not to just allow requests to contain SET statements to be included prior to a insert/replace/select statement?

That would work too, and would solve the "need to feed parameters into the validator" issue that I ran into. Do you know if Calcite supports multi-statement SQLs? I suppose if not, we could try to extend the parser to support that.

Apr 02 '25 16:04 gianm

Why is the SET not introduced here? Because for SET, it's mainly introduced in a database that has session, we can run a single SET statement to update the settings at the session level.

However, for Druid, there's NO session, this means executing a single SET statement does not make any sense. To make the SET statement work, we need to send SET statement and subsquent SQLs together as multl-statments.

From end-user's view, I think this introduces some complexity.

Apr 03 '25 15:04 FrankChen021

I think SET would make sense if our stance is that each HTTP call is its own session. I am not sure if other databases do this, so maybe it's too weird, but it makes sense to me at least. TBH, I am not too picky about how we do this. Any way of setting parameters in the SQL text would be an improvement.

Apr 03 '25 17:04 gianm

Based my knowledge, Calcite does not support multi statements. See: https://github.com/apache/druid/issues/17758

Apr 07 '25 06:04 FrankChen021

We might not have our calcite stuff wired up to support multiple statements, but poking around in calcite code there is aSqlNodeList for parsing statements broken up by a semicolon, https://github.com/apache/calcite/blob/main/core/src/main/codegen/templates/Parser.jj#L1111, https://github.com/apache/calcite/blob/main/core/src/main/java/org/apache/calcite/sql/SqlNodeList.java, etc. i'll look into it to see if I can figure out how to allow us to use it

Apr 07 '25 20:04 clintropolis

It looks like if we change this to call parser.parseStmtList() instead of parser.parseStmt() we can parse lists of statements correctly, and then check for root being SqlNodeList in DruidPlanner ~here https://github.com/apache/druid/blob/master/sql/src/main/java/org/apache/druid/sql/calcite/planner/DruidPlanner.java#L145 and then do something based on whether or not there is a list of statements

Apr 07 '25 20:04 clintropolis

It looks like if we change this to call parser.parseStmtList() instead of parser.parseStmt() we can parse lists of statements correctly, and then check for root being SqlNodeList in DruidPlanner ~here https://github.com/apache/druid/blob/master/sql/src/main/java/org/apache/druid/sql/calcite/planner/DruidPlanner.java#L145 and then do something based on whether or not there is a list of statements

I opened #17894 which allows HTTP based SQL requests to do this. JDBC continues to work as it presently does since it doesn't really make sense to allow a JDBC statement to include more than 1 SQL statement and it already has a mechanism to define query context via the properties set on the connection when it is opened.

If the syntax is acceptable and people don't think it is too weird to think of an SQL HTTP request as a single/complete session, I think it satisfies the first ask here - all of the other stuff sounds pretty nice to have too, especially the string requests (after the SET statement change) and the sys table with all the available context options.

Validation is perhaps a bit trickier, but agree like you mentioned that we would need some sort of context registry to support a system table, so probably it wouldn't be too much effort to add some validation facilities to that if it exists. I do think we probably need to be considerate about this, particularly allowing unknown context parameters should probably be allowed (at least optionally), as I've seen this free form nature of the query context being used to attach additional custom tracking information to druid queries from applications so that it ends up available to metrics system.

Apr 09 '25 21:04 clintropolis

@clintropolis Thank you for the move to tackle the problem.

particularly allowing unknown context parameters should probably be allowed (at least optionally), as I've seen this free form nature of the query context being used to attach additional custom tracking information to druid queries from applications so that it ends up available to metrics system.

I don't know about it, could u show me an exmaple. But I agree that we can define a configuration to enable/disable the validation behaviour.

Apr 10 '25 01:04 FrankChen021

I don't know about it, could u show me an exmaple. But I agree that we can define a configuration to enable/disable the validation behaviour.

In our deployment, we have custom implementations of QueryMetrics, GroupByQueryMetrics, etc, which recognize certain context parameters and include them as dimensions when emitting metrics. Users can use these context parameters to add tags that show up in the metrics. For example one of the context parameters could be applicationName, which allows users to set "applicationName": "foo" on their queries and then later on check the average query time for application foo.

If extensions could add to the list of valid parameters, that would make it possible to have validation active and still do what I just described. (Although it should still be possible to disable validation anyway.)

Apr 15 '25 06:04 gianm

Enhancing Query Context Handling: Introducing SQL-Level SETTINGS, Raw SQL Support, System Table and Context Validation

Related PRs

Motivation

Proposal

Firstly, let's introduce a SETTINGS subclause in the SQL statement.

Secondly, let's improve the /druid/v2/sql endpoint by allowing Druid accept raw text SQL instead of only JSON format.

Thirdly, inside the Druid, let's define a sys.settings system table to hold all query context properties.

Forthly, Druid MUST verify if query context properties given by user queries are valid

Firstly, let's introduce a `SETTINGS` subclause in the SQL statement.

Secondly, let's improve the `/druid/v2/sql` endpoint by allowing Druid accept raw text SQL instead of only JSON format.

Thirdly, inside the Druid, let's define a `sys.settings` system table to hold all query context properties.