ClickHouse
ClickHouse copied to clipboard
Roadmap 2022 (discussion)
This is ClickHouse open-source roadmap 2022. Descriptions and links to be filled.
This roadmap does not cover the tasks related to infrastructure, orchestration, documentation, marketing, integrations, SaaS, drivers, etc.
See also:
Roadmap 2021: #17623 Roadmap 2020: in Russian
Main Tasks
✔️ Make clickhouse-keeper Production Ready
✔️ It is already feature-complete and being used in production. ✔️ Update documentation to replace ZooKeeper with clickhouse-keeper everywhere.
✔️ Support for Backup and Restore
✔️ Backup of tables, databases, servers and clusters. ✔️ Incremental backups. Support for partial restore. ✔️ Support for pluggable backup storage options.
Semistructured Data
✔️ JSON data type with automatic type inference and dynamic subcolumns. ✔️ Sparse column format and optimization of functions for sparse columns. #22535 Dynamic selection of column format - full, const, sparse, low cardinality. Hybrid wide/compact data part format for huge number of columns.
✔️ Type Inference for Data Import
✔️ Allow to skip column names and types if data format already contains schema (e.g. Parquet, Avro). ✔️ Allow to infer types for text formats (e.g. CSV, TSV, JSONEachRow).
#32455
Support for Transactions
Atomic insert of more than one block or to more than one partition into MergeTree and ReplicatedMergeTree tables. Atomic insert into table and dependent materialized views. Atomic insert into multiple tables. Multiple SELECTs from one consistent snapshot. Atomic insert into distributed table.
✔️ Lightweight DELETE
✔️ Make mutations more lightweight by using delete-masks. ✔️ It won't enable frequent UPDATE/DELETE like in OLTP databases, but will make it more close.
SQL Compatibility Improvements
Untangle name resolution and query analysis. Initial support for correlated subqueries. ✔️ Allow using window functions inside expressions. Add compatibility aliases for some window functions, etc. ✔️ Support for GROUPING SETS.
JOIN Improvements
Support for join reordering. Extend the cases when condition pushdown is applicable. Convert anti-join to NOT IN. Use table sorting for DISTINCT optimization. Use table sorting for merge JOIN. Grace hash join algorithm.
Resource Management
✔️ Memory overcommit (sort and hard memory limits). Enable external GROUP BY and ORDER BY by default. IO operations scheduler with priorities. ✔️ Make scalar subqueries accountable. CPU and network priorities.
Separation of Storage and Compute
✔️ Parallel reading from replicas. ✔️ Dynamic cluster configuration with service discovery. ✔️ Caching of data from object storage. Simplification of ReplicatedMergeTree. Shared metadata storage.
Experimental and Intern Tasks
Streaming Queries
Fix POPULATE for materialized views. Unification of materialized views, live views and window views. Allow to set up subscriptions on top of all tables including Merge, Distributed. Normalization of Kafka tables with storing offsets in ClickHouse. Support for exactly once consumption from Kafka, non-consuming reads and multiple consumers. Streaming queries with GROUP BY, ORDER BY with windowing criterias. Persistent queues on top of ClickHouse tables.
Integration with ML/AI
:wastebasket: Integration with Tensorflow :wastebasket: Integration with MADLib
GPU Support
Compile expressions to GPU
Unique Key Constraint
User Defined Data Types
Incremental aggregation in memory
Key-value data marts
Text Classification
Graph Processing
Foreign SQL Dialects in ClickHouse
Support for MySQL dialect or Apache Calcite as an option.
Batch Jobs and Refreshable Materialized Views
Embedded ClickHouse Engine
Data Hub
Build And Testing Improvements
Testing
✔️ Add tests for AArch64 builds. ✔️ Automated tests for backward compatibility. Server-side query fuzzer for all kind of tests. Fuzzing of query settings in functional tests. SQL function based fuzzer. Fuzzer of data formats. Integrate with SQLogicTest. Import obfuscated queries from Yandex Metrica.
Builds
✔️ Docker images for AArch64. Enable missing libraries for AArch64 builds. Add and explore Musl builds. Build all libraries with our own CMake files. Embed root certificates to the binary. Embed DNS resolver to the binary. Add ClickHouse to Snap, so people will not install obsolete versions by accident.
Pls don't mind me here. I'm just reserving my spot for update notifications.
@ramazanpolat you can do it via "subscribe" button
What would the embedded Clickhouse engine look like - would it involve self-contained DB files and instances like DuckDB? This would be pretty great, it would make Clickhouse a good choice for one-off, self-contained projects.
@alanpaulkwan
Yes, something like clickhouse-local but embedded in Python module and with some additional support for dataframes. Pretty similar to DuckDB :) Also should leverage "Type Inference for Data Import".
PS. clickhouse-local
already does most of this. With recent ClickHouse versions, if I need to check some queries quickly, I just type clickhouse-local
and create tables and run queries interactively.
@alexey-milovidov thanks! I'm aware of Clickhouse local. One advantage of DuckDB over Clickhouse is that I need to work with Parquet files, ad hoc queries are impossible with schema inference requirements. So I'm also excited for that.
Will the embedded module also be designed to work well with R? Don't discriminate against us R users please :(
Yes, Python first and R is next. This task is under "experimental" category, so it will started to be implemented by developer outside of the main team (by @LGrishin). Usually for the tasks from that category we have some prototype available in summer.
Hi @alexey-milovidov I find there is a plan for Workload management to deal with concurrency issues in 2021. But it disappears in 2022, why not finish it?
@yiguolei It is named Resource Management in the roadmap.
Yes, it also was for 2021 year, but we have only started implementing it:
- (done) an interface for IO schedulers;
- (done) removing DataStreams in favor of Processors;
- (in progress) memory overcommit with soft/hard limits;
So, most of the work is expected in 2022.
@alexey-milovidov What about concurrency management. I think there are too many threads during many concurrent queries. Any progress on this?
It is going to be solved by one of the subtasks - common data processing pipeline for server, the task is being implemented by @KochetovNicolai
How about user defined aggregation function? Or user defined table function like Snowflake: https://docs.snowflake.com/en/developer-guide/udf/java/udf-java-tabular-functions.html Which can help users to process blocks of data and output only one row result.
Also this big task https://github.com/ClickHouse/ClickHouse/issues/23194
@Zhile
How about user defined aggregation function? Or user defined table function
We already have user defined table functions, since version 21.10. They allow custom data generation, transformation, aggregation and even joining with user-defined programs. See https://presentations.clickhouse.com/meetup56/new_features/
For user defined aggregate function - it's more difficult, will see...
Also this big task #23194
This is №1 in:
SQL Compatibility Improvements
- Untangle name resolution and query analysis.
@alexey-milovidov Thanks for your explanation and I'm looking forward to those new changes of ClickHouse and hoping it to be better!
@alexey-milovidov
SQL Compatibility Improvements Untangle name resolution and query analysis
Is this the limit for cannot using recursive UDF now?
Streaming queries with GROUP BY, ORDER BY with windowing criteria.
Wondering if this is somehow related to know KSQL does things, e,g:
SELECT ...
FROM orders o
INNER JOIN payments p WITHIN 1 HOURS ON p.id = o.id
It found quite hard to work with streaming-like data, especially when working with streams that need to be joined with themselves. Mat views are a way to do it but they don't support to be self joined. You can do with Null table hacks, but still a problem with the matview insertion order (well, the lack of).
Not sure if there is a better way to do this you are thinking about.
@javisantana
You can look into: https://github.com/ClickHouse/ClickHouse/pull/8331
It's already merged.
@cmsxbc Yes, recursive SQL UDFs are difficult, most likely we will not be able to support them in near months.
Would Unique Key Constraint
allow us to no longer worry about preventing duplicates from being inserted?
Materialized MySQL supports table filtering instead of all tables in the database
Maybe we can implement Batch Jobs and Refreshable Materialized Views
using time window functions? I think the difference between a batch job in Materialized Views and a streaming job in Window Views is whether to calculate and store the intermediate states. We can implement the batch job by removing the calculation of the intermediate state in Window View, and just use the processing time
with the time window function to trigger windows.
Could these be feasible?
Array Join: https://github.com/ClickHouse/ClickHouse/issues/8687 Full text search: https://github.com/ClickHouse/ClickHouse/issues/19970
Any work on Subpartition/Dynamic Partition planned?
https://github.com/ClickHouse/ClickHouse/issues/8089 https://github.com/ClickHouse/ClickHouse/issues/13826 https://github.com/ClickHouse/ClickHouse/issues/16565 https://github.com/ClickHouse/ClickHouse/issues/18695
@UnamedRus thanks, do you have a "real world" example. The PR lacks of documentation right now so it'd be nice to see how you are using it
@javisantana WindowView documents are added here
@Vxider thanks, referring to my original comment, I don't see how this window MV solves the problem of generating MV from stream data that need to join with itself (or other streams) to get the previous state of incoming entities. (more in this line -> https://calcite.apache.org/docs/stream.html#joining-streams-to-streams )
If use pg as SQL layer , it should be perfect . :)
@Vxider thanks, referring to my original comment, I don't see how this window MV solves the problem of generating MV from stream data that need to join with itself (or other streams) to get the previous state of incoming entities. (more in this line -> https://calcite.apache.org/docs/stream.html#joining-streams-to-streams )
@javisantana Yes, window view does not support join streams to table/streams now, I'll try to implement joint to table first.
@bputt
Would Unique Key Constraint allow us to no longer worry about preventing duplicates from being inserted?
Yes. But keep in mind that this is in the list of experimental tasks, not in the main list. The basic idea is simple:
- make in-memory + on-disk data structure to keep unique keys (possibly hashed to 128 bit; possibly with retention options; possibly with approximate variants), rocksdb may suffice;
- put the updates to this data structure under RAFT (it is individual per replicated table and has custom data model, in contrast to clickhouse-keeper; but after we made clickhouse-keeper working, this task is easy).
@kiwimg
Materialized MySQL supports table filtering instead of all tables in the database
Improvements for MaterializedMySQL
are not in the roadmap and this is something like "side-feature", so most likely it's not for my team. This task can be easily implemented by community. For example, @stigsb may be interested.