duckdb-extension-clickhouse-sql
duckdb-extension-clickhouse-sql copied to clipboard
URL Flock function
url_flock
The url_flock function was inspired by an idea of @carlopi from DuckDB Labs 👋
When used in combination with the httpserver extension (or compatible endpoints such as ClickHouse HTTP using format JSONEachRow) the url_flock helper builds a multi-backend query with UNON ALL aggregation of results.
Limitations
Currently the function expects all backends to return the same columns (or no data).
Examples
D SELECT * FROM url_flock('SELECT ''hello'', version()', ['http://server1', 'http://server2']);
┌─────────┬─────────────┐
│ 'hello' │ "version"() │
│ varchar │ varchar │
├─────────┼─────────────┤
│ hello │ v1.1.1 │
│ hello │ v1.1.1 │
└─────────┴─────────────┘
The construction supports Token based authentication to the httpserver extension:
D CREATE SECRET extra_http_headers (
TYPE HTTP,
EXTRA_HTTP_HEADERS MAP{
'X-API-Key': 'secret'
}
);
Known Issues
HTTP Auth
HEAD requests fail when Basic Auth user:password@host is included in the url,. Works with X-Tokens.
- [ ] Debug Basic Authentication handler
Online example mixing up DuckDB httpserver and ClickHouse API requests using JSONEachRow/NDJSON
--- Create a Backends table
CREATE TABLE IF NOT EXISTS backends (who VARCHAR, url VARCHAR);
--- Inserts some Backends
INSERT INTO backends VALUES
('httpserver', 'https://quackpy.fly.dev'),
('httpserver2', 'https://cowsdb.fly.dev');
--- Run queries across all backends
SET variable __backends = (SELECT ARRAY_AGG(url) AS urls_array FROM backends);
SELECT * FROM url_flock('SELECT ''hello'', version()', getvariable('__backends') );
@carlopi we're almost there! do you perhaps know of a smarter way to achieve the following without passing via a SET?
--- Run queries across all backends
SET variable __backends = (SELECT ARRAY_AGG(url) AS urls_array FROM backends);
SELECT * FROM url_flock('SELECT ''hello'', version()', getvariable('__backends') );
TableFunction f(
"url_flock",
{LogicalType::VARCHAR, LogicalType::LIST(LogicalType::VARCHAR)},
DuckFlockImplementation,
DuckFlockBind,
nullptr,
nullptr
);
If we try to pass a ::VARCHAR[] as second parameter directly it complains the field can't be cast to the destination type LIST
I will take a look at the code, it's either hard (due to read_json not being table-in table-out) or simple.
One thing that I think could make sense, possibly optionally via parameter, is adding a column duck that for each rows gives the identifier of the endpoint that generated the row.
And possibly making so it's possible to query the local duckdb via the same interface (or an easy way to add it to the flock).
Other question I have is on which extensions this need to reside, I am not completely sure, possibly http_server could also make sense. It's possible to have cross-extensions dependencies (somewhat), say by doing LoadExtension('http_server') on the Load of chsql.
I will take a look at the code, it's either hard (due to read_json not being table-in table-out) or simple.
Thanks in advance! We originally tried making this with a regular MACRO but condition 1 above was preventing it....
One thing that I think could make sense, possibly optionally via parameter, is adding a column
duckthat for each rows gives the identifier of the endpoint that generated the row. And possibly making so it's possible to query the local duckdb via the same interface (or an easy way to add it to the flock).
That would be amazing but I cannot imagine how :) I also thought about passing a DB/schema as a setting to have the function do the host lookups directly but that brings a different set of challenges as well. Curious as of what you think.
Other question I have is on which extensions this need to reside, I am not completely sure, possibly http_server could also make sense. It's possible to have cross-extensions dependencies (somewhat), say by doing LoadExtension('http_server') on the Load of chsql.
This function was originally added to httpserver but i felt it would bloat an already complex role with an even more complex dependency so we decided to stash in the chsql extension as a superpowered url function just to get it on the field since by design this function should be compatible with the Clickhouse HTTP API using JSONEachRow as format. However, we're happy to revert direction with anything you would suggest to make sure its as logically viable as it can be.
Once again thanks in advance for helping us square this out and for the original inspiration :)
I was thinking something like:
CREATE TABLE IF NOT EXISTS backends (who VARCHAR, url VARCHAR);
--- Inserts some Backends
INSERT INTO backends VALUES
('httpserver', 'https://quackpy.fly.dev'),
('httpserver2', 'https://cowsdb.fly.dev');
SELECT * FROM backends, LATERAL (FROM url_flock('SELECT version()', backends.url));
could possibly work, making url_flock a table-in, table-out, that takes a single backend BUT it's table-in, table-out. Unsure of the interface, but I think it makes sense to register backends (or deletes them) from a table, and using that as reference.
I will try to make some progress later, I though this was still interesting to share.
Purely to provide further context, the following ClickHouse functions are in scope of this function emulation:
SELECT * FROM remote(127.0.0.1, view(SELECT a, b, c FROM table_name));
SELECT * FROM cluster(cluster_name, view(SELECT a, b, c FROM table_name));
Essentially querying a remote node or cluster or nodes as ours with parameters inverted - the above variations can be implemented as simple macro aliases on top of the final command we'll come up with. Also in context, the 'cluster_name' approach seems a good one to follow allowing users to group servers by cluster_name when selecting from a backends table.
It's only tangentially related and I don't think the ergonomics in DuckDB afford this yet, but Arrow IPC over HTTP could be really interesting in conjunction
It's only tangentially related and I don't think the ergonomics in DuckDB afford this yet, but Arrow IPC over HTTP could be really interesting in conjunction
@nicosuave I think this might be possible using to_arrow_ipc and BLOB as type with the url() macro
EDIT: some other interesting example here
I will take a look at the code, it's either hard (due to read_json not being table-in table-out) or simple.
I'm super curious about table-in-out as I do not believe I understand it yet and how to use it. Thanks in advance!
Other question I have is on which extensions this need to reside, I am not completely sure, possibly http_server could also make sense. It's possible to have cross-extensions dependencies (somewhat), say by doing LoadExtension('http_server') on the Load of chsql.
Ultimately you were right. we're going to have a flock function in http_server alongside the discovery API to fill it up!