duckdb-extension-clickhouse-sql icon indicating copy to clipboard operation
duckdb-extension-clickhouse-sql copied to clipboard

URL Flock function

Open lmangani opened this issue 1 year ago • 9 comments

duckdb_writer_json-ezgif com-crop

url_flock

The url_flock function was inspired by an idea of @carlopi from DuckDB Labs 👋

When used in combination with the httpserver extension (or compatible endpoints such as ClickHouse HTTP using format JSONEachRow) the url_flock helper builds a multi-backend query with UNON ALL aggregation of results.

Limitations

Currently the function expects all backends to return the same columns (or no data).

Examples

D SELECT * FROM url_flock('SELECT ''hello'', version()', ['http://server1', 'http://server2']);
┌─────────┬─────────────┐
│ 'hello' │ "version"() │
│ varchar │   varchar   │
├─────────┼─────────────┤
│ hello   │ v1.1.1      │
│ hello   │ v1.1.1      │
└─────────┴─────────────┘

The construction supports Token based authentication to the httpserver extension:

D CREATE SECRET extra_http_headers (
      TYPE HTTP,
      EXTRA_HTTP_HEADERS MAP{
          'X-API-Key': 'secret'
      }
  );

Known Issues

HTTP Auth

HEAD requests fail when Basic Auth user:password@host is included in the url,. Works with X-Tokens.

  • [ ] Debug Basic Authentication handler

lmangani avatar Oct 18 '24 18:10 lmangani

Online example mixing up DuckDB httpserver and ClickHouse API requests using JSONEachRow/NDJSON

--- Create a Backends table
CREATE TABLE IF NOT EXISTS backends (who VARCHAR, url VARCHAR);

--- Inserts some Backends
INSERT INTO backends VALUES 
('httpserver', 'https://quackpy.fly.dev'),
('httpserver2', 'https://cowsdb.fly.dev');

--- Run queries across all backends
SET variable __backends = (SELECT ARRAY_AGG(url) AS urls_array FROM backends);
SELECT * FROM url_flock('SELECT ''hello'', version()', getvariable('__backends') );

lmangani avatar Oct 18 '24 23:10 lmangani

@carlopi we're almost there! do you perhaps know of a smarter way to achieve the following without passing via a SET?

--- Run queries across all backends
SET variable __backends = (SELECT ARRAY_AGG(url) AS urls_array FROM backends);
SELECT * FROM url_flock('SELECT ''hello'', version()', getvariable('__backends') );
 TableFunction f(
            "url_flock",
            {LogicalType::VARCHAR, LogicalType::LIST(LogicalType::VARCHAR)},
            DuckFlockImplementation,
            DuckFlockBind,
            nullptr,
            nullptr
        );

If we try to pass a ::VARCHAR[] as second parameter directly it complains the field can't be cast to the destination type LIST

lmangani avatar Oct 22 '24 21:10 lmangani

I will take a look at the code, it's either hard (due to read_json not being table-in table-out) or simple.

One thing that I think could make sense, possibly optionally via parameter, is adding a column duck that for each rows gives the identifier of the endpoint that generated the row. And possibly making so it's possible to query the local duckdb via the same interface (or an easy way to add it to the flock).

Other question I have is on which extensions this need to reside, I am not completely sure, possibly http_server could also make sense. It's possible to have cross-extensions dependencies (somewhat), say by doing LoadExtension('http_server') on the Load of chsql.

carlopi avatar Oct 23 '24 10:10 carlopi

I will take a look at the code, it's either hard (due to read_json not being table-in table-out) or simple.

Thanks in advance! We originally tried making this with a regular MACRO but condition 1 above was preventing it....

One thing that I think could make sense, possibly optionally via parameter, is adding a column duck that for each rows gives the identifier of the endpoint that generated the row. And possibly making so it's possible to query the local duckdb via the same interface (or an easy way to add it to the flock).

That would be amazing but I cannot imagine how :) I also thought about passing a DB/schema as a setting to have the function do the host lookups directly but that brings a different set of challenges as well. Curious as of what you think.

Other question I have is on which extensions this need to reside, I am not completely sure, possibly http_server could also make sense. It's possible to have cross-extensions dependencies (somewhat), say by doing LoadExtension('http_server') on the Load of chsql.

This function was originally added to httpserver but i felt it would bloat an already complex role with an even more complex dependency so we decided to stash in the chsql extension as a superpowered url function just to get it on the field since by design this function should be compatible with the Clickhouse HTTP API using JSONEachRow as format. However, we're happy to revert direction with anything you would suggest to make sure its as logically viable as it can be.

Once again thanks in advance for helping us square this out and for the original inspiration :)

lmangani avatar Oct 23 '24 10:10 lmangani

I was thinking something like:

CREATE TABLE IF NOT EXISTS backends (who VARCHAR, url VARCHAR);

--- Inserts some Backends
INSERT INTO backends VALUES 
('httpserver', 'https://quackpy.fly.dev'),
('httpserver2', 'https://cowsdb.fly.dev');
SELECT * FROM backends, LATERAL (FROM url_flock('SELECT version()', backends.url));

could possibly work, making url_flock a table-in, table-out, that takes a single backend BUT it's table-in, table-out. Unsure of the interface, but I think it makes sense to register backends (or deletes them) from a table, and using that as reference.

I will try to make some progress later, I though this was still interesting to share.

carlopi avatar Oct 23 '24 11:10 carlopi

Purely to provide further context, the following ClickHouse functions are in scope of this function emulation:

SELECT * FROM remote(127.0.0.1, view(SELECT a, b, c FROM table_name));
SELECT * FROM cluster(cluster_name, view(SELECT a, b, c FROM table_name));

Essentially querying a remote node or cluster or nodes as ours with parameters inverted - the above variations can be implemented as simple macro aliases on top of the final command we'll come up with. Also in context, the 'cluster_name' approach seems a good one to follow allowing users to group servers by cluster_name when selecting from a backends table.

lmangani avatar Oct 23 '24 21:10 lmangani

It's only tangentially related and I don't think the ergonomics in DuckDB afford this yet, but Arrow IPC over HTTP could be really interesting in conjunction

nicosuave avatar Oct 31 '24 16:10 nicosuave

It's only tangentially related and I don't think the ergonomics in DuckDB afford this yet, but Arrow IPC over HTTP could be really interesting in conjunction

@nicosuave I think this might be possible using to_arrow_ipc and BLOB as type with the url() macro EDIT: some other interesting example here

lmangani avatar Nov 07 '24 23:11 lmangani

I will take a look at the code, it's either hard (due to read_json not being table-in table-out) or simple.

I'm super curious about table-in-out as I do not believe I understand it yet and how to use it. Thanks in advance!

Other question I have is on which extensions this need to reside, I am not completely sure, possibly http_server could also make sense. It's possible to have cross-extensions dependencies (somewhat), say by doing LoadExtension('http_server') on the Load of chsql.

Ultimately you were right. we're going to have a flock function in http_server alongside the discovery API to fill it up!

lmangani avatar Nov 17 '24 20:11 lmangani