ClickHouse icon indicating copy to clipboard operation
ClickHouse copied to clipboard

Roadmap 2022 (discussion)

Open alexey-milovidov opened this issue 3 years ago • 140 comments

This is ClickHouse open-source roadmap 2022. Descriptions and links to be filled.

This roadmap does not cover the tasks related to infrastructure, orchestration, documentation, marketing, integrations, SaaS, drivers, etc.

See also:

Roadmap 2021: #17623 Roadmap 2020: in Russian

Main Tasks

✔️ Make clickhouse-keeper Production Ready

✔️ It is already feature-complete and being used in production. ✔️ Update documentation to replace ZooKeeper with clickhouse-keeper everywhere.

✔️ Support for Backup and Restore

✔️ Backup of tables, databases, servers and clusters. ✔️ Incremental backups. Support for partial restore. ✔️ Support for pluggable backup storage options.

Semistructured Data

✔️ JSON data type with automatic type inference and dynamic subcolumns. ✔️ Sparse column format and optimization of functions for sparse columns. #22535 Dynamic selection of column format - full, const, sparse, low cardinality. Hybrid wide/compact data part format for huge number of columns.

✔️ Type Inference for Data Import

✔️ Allow to skip column names and types if data format already contains schema (e.g. Parquet, Avro). ✔️ Allow to infer types for text formats (e.g. CSV, TSV, JSONEachRow).

#32455

Support for Transactions

Atomic insert of more than one block or to more than one partition into MergeTree and ReplicatedMergeTree tables. Atomic insert into table and dependent materialized views. Atomic insert into multiple tables. Multiple SELECTs from one consistent snapshot. Atomic insert into distributed table.

✔️ Lightweight DELETE

✔️ Make mutations more lightweight by using delete-masks. ✔️ It won't enable frequent UPDATE/DELETE like in OLTP databases, but will make it more close.

SQL Compatibility Improvements

Untangle name resolution and query analysis. Initial support for correlated subqueries. ✔️ Allow using window functions inside expressions. Add compatibility aliases for some window functions, etc. ✔️ Support for GROUPING SETS.

JOIN Improvements

Support for join reordering. Extend the cases when condition pushdown is applicable. Convert anti-join to NOT IN. Use table sorting for DISTINCT optimization. Use table sorting for merge JOIN. Grace hash join algorithm.

Resource Management

✔️ Memory overcommit (sort and hard memory limits). Enable external GROUP BY and ORDER BY by default. IO operations scheduler with priorities. ✔️ Make scalar subqueries accountable. CPU and network priorities.

Separation of Storage and Compute

✔️ Parallel reading from replicas. ✔️ Dynamic cluster configuration with service discovery. ✔️ Caching of data from object storage. Simplification of ReplicatedMergeTree. Shared metadata storage.

Experimental and Intern Tasks

Streaming Queries

Fix POPULATE for materialized views. Unification of materialized views, live views and window views. Allow to set up subscriptions on top of all tables including Merge, Distributed. Normalization of Kafka tables with storing offsets in ClickHouse. Support for exactly once consumption from Kafka, non-consuming reads and multiple consumers. Streaming queries with GROUP BY, ORDER BY with windowing criterias. Persistent queues on top of ClickHouse tables.

Integration with ML/AI

:wastebasket: Integration with Tensorflow :wastebasket: Integration with MADLib

GPU Support

Compile expressions to GPU

Unique Key Constraint

User Defined Data Types

Incremental aggregation in memory

Key-value data marts

Text Classification

Graph Processing

Foreign SQL Dialects in ClickHouse

Support for MySQL dialect or Apache Calcite as an option.

Batch Jobs and Refreshable Materialized Views

Embedded ClickHouse Engine

Data Hub

Build And Testing Improvements

Testing

✔️ Add tests for AArch64 builds. ✔️ Automated tests for backward compatibility. Server-side query fuzzer for all kind of tests. Fuzzing of query settings in functional tests. SQL function based fuzzer. Fuzzer of data formats. Integrate with SQLogicTest. Import obfuscated queries from Yandex Metrica.

Builds

✔️ Docker images for AArch64. Enable missing libraries for AArch64 builds. Add and explore Musl builds. Build all libraries with our own CMake files. Embed root certificates to the binary. Embed DNS resolver to the binary. Add ClickHouse to Snap, so people will not install obsolete versions by accident.

alexey-milovidov avatar Dec 10 '21 15:12 alexey-milovidov

Pls don't mind me here. I'm just reserving my spot for update notifications.

ramazanpolat avatar Dec 10 '21 16:12 ramazanpolat

@ramazanpolat you can do it via "subscribe" button image

Slach avatar Dec 10 '21 17:12 Slach

What would the embedded Clickhouse engine look like - would it involve self-contained DB files and instances like DuckDB? This would be pretty great, it would make Clickhouse a good choice for one-off, self-contained projects.

alanpaulkwan avatar Dec 10 '21 20:12 alanpaulkwan

@alanpaulkwan

Yes, something like clickhouse-local but embedded in Python module and with some additional support for dataframes. Pretty similar to DuckDB :) Also should leverage "Type Inference for Data Import".

PS. clickhouse-local already does most of this. With recent ClickHouse versions, if I need to check some queries quickly, I just type clickhouse-local and create tables and run queries interactively.

alexey-milovidov avatar Dec 10 '21 20:12 alexey-milovidov

@alexey-milovidov thanks! I'm aware of Clickhouse local. One advantage of DuckDB over Clickhouse is that I need to work with Parquet files, ad hoc queries are impossible with schema inference requirements. So I'm also excited for that.

Will the embedded module also be designed to work well with R? Don't discriminate against us R users please :(

alanpaulkwan avatar Dec 10 '21 20:12 alanpaulkwan

Yes, Python first and R is next. This task is under "experimental" category, so it will started to be implemented by developer outside of the main team (by @LGrishin). Usually for the tasks from that category we have some prototype available in summer.

alexey-milovidov avatar Dec 10 '21 20:12 alexey-milovidov

Hi @alexey-milovidov I find there is a plan for Workload management to deal with concurrency issues in 2021. But it disappears in 2022, why not finish it?

yiguolei avatar Dec 13 '21 01:12 yiguolei

@yiguolei It is named Resource Management in the roadmap.

Yes, it also was for 2021 year, but we have only started implementing it:

  • (done) an interface for IO schedulers;
  • (done) removing DataStreams in favor of Processors;
  • (in progress) memory overcommit with soft/hard limits;

So, most of the work is expected in 2022.

alexey-milovidov avatar Dec 13 '21 02:12 alexey-milovidov

@alexey-milovidov What about concurrency management. I think there are too many threads during many concurrent queries. Any progress on this?

yiguolei avatar Dec 13 '21 02:12 yiguolei

It is going to be solved by one of the subtasks - common data processing pipeline for server, the task is being implemented by @KochetovNicolai

alexey-milovidov avatar Dec 13 '21 03:12 alexey-milovidov

How about user defined aggregation function? Or user defined table function like Snowflake: https://docs.snowflake.com/en/developer-guide/udf/java/udf-java-tabular-functions.html Which can help users to process blocks of data and output only one row result.

Zhile avatar Dec 13 '21 05:12 Zhile

Also this big task https://github.com/ClickHouse/ClickHouse/issues/23194

Zhile avatar Dec 13 '21 05:12 Zhile

@Zhile

How about user defined aggregation function? Or user defined table function

We already have user defined table functions, since version 21.10. They allow custom data generation, transformation, aggregation and even joining with user-defined programs. See https://presentations.clickhouse.com/meetup56/new_features/

For user defined aggregate function - it's more difficult, will see...

Also this big task #23194

This is №1 in:

SQL Compatibility Improvements

  • Untangle name resolution and query analysis.

alexey-milovidov avatar Dec 13 '21 06:12 alexey-milovidov

@alexey-milovidov Thanks for your explanation and I'm looking forward to those new changes of ClickHouse and hoping it to be better!

Zhile avatar Dec 13 '21 06:12 Zhile

@alexey-milovidov

SQL Compatibility Improvements Untangle name resolution and query analysis

Is this the limit for cannot using recursive UDF now?

cmsxbc avatar Dec 13 '21 09:12 cmsxbc

Streaming queries with GROUP BY, ORDER BY with windowing criteria.

Wondering if this is somehow related to know KSQL does things, e,g:

SELECT ...
FROM orders o
        INNER JOIN payments p WITHIN 1 HOURS ON p.id = o.id

It found quite hard to work with streaming-like data, especially when working with streams that need to be joined with themselves. Mat views are a way to do it but they don't support to be self joined. You can do with Null table hacks, but still a problem with the matview insertion order (well, the lack of).

Not sure if there is a better way to do this you are thinking about.

javisantana avatar Dec 13 '21 15:12 javisantana

@javisantana

You can look into: https://github.com/ClickHouse/ClickHouse/pull/8331

It's already merged.

UnamedRus avatar Dec 13 '21 15:12 UnamedRus

@cmsxbc Yes, recursive SQL UDFs are difficult, most likely we will not be able to support them in near months.

alexey-milovidov avatar Dec 13 '21 17:12 alexey-milovidov

Would Unique Key Constraint allow us to no longer worry about preventing duplicates from being inserted?

bputt avatar Dec 14 '21 04:12 bputt

Materialized MySQL supports table filtering instead of all tables in the database

kiwimg avatar Dec 14 '21 05:12 kiwimg

Maybe we can implement Batch Jobs and Refreshable Materialized Views using time window functions? I think the difference between a batch job in Materialized Views and a streaming job in Window Views is whether to calculate and store the intermediate states. We can implement the batch job by removing the calculation of the intermediate state in Window View, and just use the processing time with the time window function to trigger windows.

Vxider avatar Dec 15 '21 02:12 Vxider

Could these be feasible?

Array Join: https://github.com/ClickHouse/ClickHouse/issues/8687 Full text search: https://github.com/ClickHouse/ClickHouse/issues/19970

bkuschel avatar Dec 16 '21 19:12 bkuschel

Any work on Subpartition/Dynamic Partition planned?

https://github.com/ClickHouse/ClickHouse/issues/8089 https://github.com/ClickHouse/ClickHouse/issues/13826 https://github.com/ClickHouse/ClickHouse/issues/16565 https://github.com/ClickHouse/ClickHouse/issues/18695

simpl1g avatar Dec 21 '21 12:12 simpl1g

@UnamedRus thanks, do you have a "real world" example. The PR lacks of documentation right now so it'd be nice to see how you are using it

javisantana avatar Dec 22 '21 09:12 javisantana

@javisantana WindowView documents are added here

Vxider avatar Dec 22 '21 09:12 Vxider

@Vxider thanks, referring to my original comment, I don't see how this window MV solves the problem of generating MV from stream data that need to join with itself (or other streams) to get the previous state of incoming entities. (more in this line -> https://calcite.apache.org/docs/stream.html#joining-streams-to-streams )

javisantana avatar Dec 22 '21 09:12 javisantana

If use pg as SQL layer , it should be perfect . :)

thomasdba avatar Dec 30 '21 02:12 thomasdba

@Vxider thanks, referring to my original comment, I don't see how this window MV solves the problem of generating MV from stream data that need to join with itself (or other streams) to get the previous state of incoming entities. (more in this line -> https://calcite.apache.org/docs/stream.html#joining-streams-to-streams )

@javisantana Yes, window view does not support join streams to table/streams now, I'll try to implement joint to table first.

Vxider avatar Dec 31 '21 07:12 Vxider

@bputt

Would Unique Key Constraint allow us to no longer worry about preventing duplicates from being inserted?

Yes. But keep in mind that this is in the list of experimental tasks, not in the main list. The basic idea is simple:

  • make in-memory + on-disk data structure to keep unique keys (possibly hashed to 128 bit; possibly with retention options; possibly with approximate variants), rocksdb may suffice;
  • put the updates to this data structure under RAFT (it is individual per replicated table and has custom data model, in contrast to clickhouse-keeper; but after we made clickhouse-keeper working, this task is easy).

alexey-milovidov avatar Jan 01 '22 12:01 alexey-milovidov

@kiwimg

Materialized MySQL supports table filtering instead of all tables in the database

Improvements for MaterializedMySQL are not in the roadmap and this is something like "side-feature", so most likely it's not for my team. This task can be easily implemented by community. For example, @stigsb may be interested.

alexey-milovidov avatar Jan 01 '22 12:01 alexey-milovidov