metabase-clickhouse-driver icon indicating copy to clipboard operation
metabase-clickhouse-driver copied to clipboard

CSV uploads

Open calherries opened this issue 3 months ago • 1 comments

Draft PR: this currently only supports on-premise single node deployments, but the goal is to support all deployment types.

Summary

Resolves https://github.com/ClickHouse/metabase-clickhouse-driver/issues/230

This PR adds support for the uploads driver feature for ClickHouse.

Limitations:

  • OffsetDateTime values in the CSV e.g. "2022-01-01T00:00:00+01" are inserted as strings, while in other officially supported drivers they would be inserted as a type with a time zone. This is blocked by #200
  • There is no automatically generated primary key column. This is because ClickHouse doesn't support an auto-incrementing type like SERIAL in PostgreSQL.

Checklist

Delete items not relevant to your PR:

  • [x] Unit and integration tests covering the common scenarios were added
  • [ ] A human-readable description of the changes was provided to include in CHANGELOG
  • [ ] For significant changes, documentation in https://github.com/ClickHouse/clickhouse-docs was updated with further explanations or tutorials

calherries avatar May 10 '24 19:05 calherries

CLA assistant check
All committers have signed the CLA.

CLAassistant avatar May 10 '24 19:05 CLAassistant

@slvrtrn I have a few questions: (1) Is there a way to run tests with ClickHouse Cloud deployments? (2) Is there an alternative way to determine the number of replicas without using system.macros, as you suggested here? I can create a ClickHouse cloud user without sufficient privileges to select from the system.macros table.

Executing this query

SELECT count(*) AS nodes_count FROM system.clusters c JOIN system.macros m ON c.cluster = m.substitution

gives me this error

> Not enough privileges. To execute this query, it's necessary to have the grant SELECT(substitution) ON system.macros

(3) If there's no solution to (2), would you be willing to accept this contribution if it only worked for ClickHouse Cloud in the meantime? That's the deployment type we care about most. With this commit I've updated the PR so we only support ClickHouse Cloud.

calherries avatar May 15 '24 02:05 calherries

(1) — You can provide the hostname and port via the env variables; IIRC, it won't work correctly unless you add wait_end_of_query=1 to the JDBC connection string, cause it will not wait until the tables are created on all replicas, and the driver might immediately call the wrong one.

(2) - I am only aware of the mentioned method

I can create a ClickHouse cloud user without sufficient privileges to select from the system.macros table.

You don't need this in CH Cloud. It's necessary only for on-premise. But yes, these grants are one of the main pain points.

(3) It will work for single-node and ClickHouse Cloud, then. But we need a way to disable it for the on-premise cluster if it's detected. Can we do that dynamically? It's the feature that is enabled or disabled on the driver level, not on the connection level.

slvrtrn avatar May 15 '24 09:05 slvrtrn

(1) Thanks, I can run the tests locally with ClickHouse Cloud. I should have been more specific with my question: Is there a way to run the tests that run with Github Actions with ClickHouse Cloud deployments?

(2) Ok thanks for clarifying

(3) How can we know whether it's an on-premise cluster vs on-premise single node without getting the node_count? You only suggested an approach using node_count in your comment here.

is_cloud = 0, nodes_count = 1 -> on-premise single node

calherries avatar May 15 '24 13:05 calherries

@slvrtrn tagging you because you might have missed my last comment above, where I didn't tag you

calherries avatar May 15 '24 18:05 calherries

@calherries (1) - currently, no. We need to add this; I will do that as a follow-up (after both this and #232 are merged). (3) - this is the best option I am aware of. I will clarify if there is anything else available, but chances are that that's it. The grants issue is annoying, but it's necessary only for on-premise deployments (and not CH Cloud, as you don't need this query there); we can add a doc entry about the required grants (after we add on-prem clusters support as a follow-up PR).

I have another question, though. Is there a way to disable a particular feature for a specific connection? These are defined in the driver/database-supports? override, but a cleaner solution would've been to enable it based on the deployment type (i.e. after we have the connection details). This is useful in this PR atm and also in #232 cause the impersonation (another feature defined via an override) should be enabled only on the most recent CH version.

slvrtrn avatar May 15 '24 19:05 slvrtrn

(3) Okay, for now I'll support CH Cloud only.

Is there a way to disable a particular feature for a specific connection?

Yes, we generally use driver/database-supports? for this, you can have the result depend on the version or the connection details. See this example with MongoDB. I've also done a similar thing for this PR in the last commit 6b8935b.

I think this PR is ready to be reviewed now, but we'll need to wait for the changes in this PR to be released in the 49 branch before there is a released Metabase version that is compatible with this.

calherries avatar May 16 '24 03:05 calherries

For this:

A human-readable description of the changes was provided to include in CHANGELOG

What version should these changes under in the changelog?

calherries avatar May 16 '24 04:05 calherries

What version should these changes under in the changelog?

I will add this, don't worry. Otherwise, there will be conflicts with #232

you can have the result depend on the version or the connection details. See this example with MongoDB. I've also done a similar thing for this PR in the last commit 6b8935b.

Thanks, this is great.

slvrtrn avatar May 16 '24 12:05 slvrtrn

btw

Resolves https://github.com/ClickHouse/metabase-clickhouse-driver/issues/230

It doesn't; IIUC, the user that requested this feature, uses on-prem.

slvrtrn avatar May 16 '24 13:05 slvrtrn

@calherries, some tests are failing on the CI.

There is one more thing. In my issue comment, I mentioned

To immediately get the data from any replica regardless of its sync status, it’s also better to add select_sequential_consistency=1. This will guarantee that immediately after the insertion, we can query any node in the cluster and get the data back.

This is an absolute necessity if we add the Cloud test run cause we need to immediately get the test data back after the table is created; otherwise, we will experience weird flakiness during the runs. This also can happen in real-world usage (albeit more rarely).

slvrtrn avatar May 16 '24 13:05 slvrtrn