ibis
ibis copied to clipboard
feat: Add implementations for Clickhouse `create_table`, `create_view`, and `create_database`
All of these are accomplishable through raw_sql method but it would be nice for these to be fully implemented. In addition, it would also be useful to have the inverse methods implemented: drop_table, drop_view, and drop_database.
Proposal
From my experience with Clickhouse, and some experience with implementing these for my own use cases, I've outlined a brief propsal for implementing these methods. Feedback would be greatly appreciated.
Create Table
This one would be the most complicated to implement. There would be two major use cases:
- Standard table creation (from
Schemaobject) - Create table from
Exprobject
Implementation for the first use case would also be used in the second.
For the first use case, a Schema object would be passed in. You would be able to grab the name and type for each field of the table from that object.
For the second use case, an Expr object would be passed in. You would be able to grab the name and type of each field from the Expr schema attribute using the same code as the previous use case.
One thing to note for creating a table from an expression: in my experience it is better to call the CREATE TABLE... DDL statement and then separately call the INSERT ... statement instead of CREATE TABLE ... AS ....
You would also need to check to see if the table already exists. If it exists, there are a few ways to replace it. You could create a temp table and once that is complete drop the target table and rename the temp table. This is how dbt-clickhouse handles it but I've found this can present challenges with the ReplicatedMergeTree table engine. Another method is to drop the target table and then run the DDL and insertion statements. The downside to this is that there is a longer period of time where not table with the identifier will be available for users to query.
Backing up a bit, there is also the question of whether a full replace is desirable to the user. Should the user have an option to abandon the process if the table already exists?
There are also a number of additional options that would need to be available to the user when creating a table:
| Option | Description | Argument Type |
|---|---|---|
| Database | The database to create the table in. If not provided, then the session database passed in to connect() should be used |
str |
| Table name | Since not all Expr child objects have a name attribute, this would need to be provided by the user |
str |
| Engine | Clickhouse table engine | str |
| On Cluster | Specify a cluster | str |
| Partition By | Values to partition the table by. Should be a tuple of strings. Could potentially use column objects but that might be more difficult for defining partition keys wrapped in a method (i.e. toYYYYMM(some_timestamp)) |
tuple[str] |
| Order By | Sorting keys | tuple[str] |
| Primary Key | Indexes | tuple[str] |
| Settings | Key-value pairs of additional table settings | dict[str, Any] |
Create View
This one would be much easier to implement and Clickhouse provides the standard CREATE OR REPLACE VIEW ... option. This would just be created from an Expr object.
Only the following options would need to be provided to the user:
| Option | Description | Argument Type |
|---|---|---|
| Database | The database to create the table in. If not provided, then the session database passed in to connect() should be used |
str |
| View name | Since not all Expr child objects have a name attribute, this would need to be provided by the user |
str |
| On Cluster | Specify a cluster | str |
Create Database
This one is likely the lowest priority out of the three. It would also likely be the easiest to implement. The user would need to be presented the following options:
| Option | Description | Argument Type |
|---|---|---|
| Database Name | Database name | str |
| On Cluster | Specify a cluster | str |
| Engine | Clickhouse database engine | str |
IF NOT EXISTS Behavior
It would likely be beneficial to also allow some way to only create the object if it doesn't already exist.
[TODO]
Notes
I'll keep working on this proposal and flesh out some ideas in a project I already have going. I'm pretty sure I've got the codegen down for the create table piece but I'll need to do some more digging to see how it is already handled in Ibis so I'm following the current approaches.
I'm happy to be the one to implement this over a weekend or two but I'd like some feedback before I dive into the code.
An alternative to dropping a table if it already exists is to truncate the table if the schema passed in and the table schema match
@mharrisb1 This is a really great proposal.
I'll go in reverse order to get the easier bits out of the way first.
Let's dive in.
Preliminaries
It'd be great if each of the CREATE TABLE/VIEW/DATABASE are done as three separate feat(clickhouse) PRs! If you want to include the DROP ... operations in each of those PRs that's fine, and a follow-up afterwards is fine too.
Backend-specific Keyword Arguments
We'll likely have to change the base class create_foo signatures to accept **kwargs so that child backend class create_foo implementations can accept backend-specific arguments. That's fine and should be a backwards compatible change.
APIs
create_database
Agree with your assessment here. Probably easy to implement and likely not a heavily used API so not a high priority. That said, it's your proposal, so feel free to prioritize as you like.
create_view
What you've proposed looks good to me. This in theory would also allow the .sql method to work with ClickHouse, which is pretty nice!
In terms of difficulty and priority, I think you're spot-on there as well: easier than create_table and less often used, but more often used than create_database.
create_table
This is where the fun begins 😅.
Implementation for the first use case would also be used in the second.
Sounds good. IIRC, the way this works in the other create_table implementations is to always create the table first. This happens inside a single method body, so it's not literally calling another implementation, but of course that is up to the implementer.
For the first use case, a Schema object would be passed in. You would be able to grab the name and type for each field of the table from that object.
Yup, sounds right! Incidentally, I've added API docs for schema in #4318. The way to iterate over the pairs of name, type is using the .items() method on Schema objects.
For the second use case, an Expr object would be passed in. You would be able to grab the name and type of each field from the Expr schema attribute using the same code as the previous use case.
Yep! You can get the schema of a TableExpr using its .schema() method.
One thing to note for creating a table from an expression: in my experience it is better to call the CREATE TABLE... DDL statement and then separately call the INSERT ... statement instead of CREATE TABLE ... AS ....
It sounds like you've got a lot of ClickHouse experience, so IMO this is entirely up to you.
You would also need to check to see if the table already exists. If it exists, there are a few ways to replace it. You could create a temp table and once that is complete drop the target table and rename the temp table. This is how dbt-clickhouse handles it but I've found this can present challenges with the ReplicatedMergeTree table engine. Another method is to drop the target table and then run the DDL and insertion statements. The downside to this is that there is a longer period of time where not table with the identifier will be available for users to query.
It might make implementation and review of the PR for create_table easier if we forego replacement since we don't yet have an API to do this using create_table. We're working on making this possible but it'll be a bit before we have something.
Backing up a bit, there is also the question of whether a full replace is desirable to the user. Should the user have an option to abandon the process if the table already exists?
Right now, this is the default behavior: force=False will compile to CREATE TABLE whereas force=True will compile to CREATE TABLE IF NOT EXISTS and the backend will fail if the table already exists. I don't believe we have any backends that both support SQL and have REPLACE behavior with CREATE TABLE by default.
Let me know if there's anything else on which you'd like me to give feedback!
Looking forward to your PRs.
@mharrisb1 Thoughts here?
I know @saulpw has been thinking about DDL-in-ibis recently and likely also has some thoughts here.
@cpcloud that all sounds good 👍 thanks for the point-by-point feedback
If @saulpw has a general approach to DDL in Ibis I would love to follow their lead. Either using another backend as a reference or guinea-pig some ideas with Clickhouse.
We've currently have an in-house solution for this so there is no way I can convince my team to use company time to implement this in Ibis so I wouldn't count on this being done (at least well) anytime soon.
I think a good first step here could be creating a topic in Discussions about Ibis + DDL to get some community feedback if people are interested