Can we get a larger dataset?

Open alex-thc opened this issue 2 years ago • 1 comments

Is it possible to get a larger data set, say 2TB or 5TB? Testing on a 200GB data set that is easily compressible down to 50GB with modern compression algorithms might exclude disk IO from the equation on systems with large caches (even if those are simple disk caches)

Jul 17 '23 23:07 alex-thc

There is a large catalog of prepared datasets: https://clickhouse.com/docs/en/getting-started/example-datasets

For example, these datasets are over 1 TB uncompressed:

Reddit comments;
YouTube likes;
GitHub events;
Wikipedia page views;
Environmental Sensors Data;

They can be loaded into ClickHouse in a few hours. There is also a list of queries https://github.com/ClickHouse/github-explorer/blob/main/queries.sql

But these datasets are not used in ClickBench, because testing all ~30 database management systems will be too slow.

For example, if you try to load Wikipedia page views (typical time-series dataset) into TimescaleDB (typical time-series DBMS) it will take months, making the benchmark impractical. If you try to load it into DuckDB, it will not load because duckdb is not a production-quality database. If you try to use Druid, or Pinot, you will need a long time to recover after PTSD.

Testing on a 200GB data set that is easily compressible down to 50GB with modern compression algorithms might exclude disk IO from the equation on systems with large caches (even if those are simple disk caches)

In fact, ClickHouse compresses it to only 9.28 GB. But the benchmark methodology requires one cold run with flushed caches, so it can test the IO subsystem. Also keep in mind, that it requires the usage of gp2 EBS volumes of size 500 GB that has a well-known IO profile (tldr, they are slow).

Jul 24 '23 02:07 alexey-milovidov