I hereby agree to the terms of the Singularity Data, Inc. Contributor License Agreement.

What's changed and what's your intention?

To introduce CH-benCHmark pipeline from postgres to RisingWave

The CH-benCHmark relies on both TPC-H and TPC-C tables.

In this PR:

postgres->debezium->kafka->RisingWave
Not everything is automated yet in CI --> help needed
In ci/ch-benchmark/transactions/example_update.sql two very simple update transactions are created
In e2e_test/streaming/ch-benchmark/create_sources.slt.part materialized sources are created for all TPC-C tables
Also in e2e_test/streaming/ch-benchmark/ are two runnable CH-benCHmark tests provided

Demonstration

// Start docker containers and initialize debezium and postgres
docker compose -f ci/docker-compose.yml up -d && sleep 5
docker exec -it ci-debezium-1 sh /connectors/create-postgres-debezium-connector.sh >/dev/null 2>&1
docker cp ci/ch-benchmark/transactions/ ci-postgres-1:/transactions/
docker exec -u postgres ci-postgres-1 psql ch_benchmark_db postgres -f /transactions/example_update.sql >/dev/null 2>&1

// Start RW cluster
./risedev d

// Create and insert TPC-H tables
./risedev slt -p 4566 -d dev ./e2e_test/tpch/create_tables.slt.part
./risedev slt -p 4566 -d dev ./e2e_test/tpch/**/insert_*.slt.part

// Create materialized sources
./risedev slt -p 4566 -d dev ./e2e_test/streaming/ch-benchmark/create_sources.slt.part

// Run two CH-benCHmark queries
./risedev slt -p 4566 -d dev ./e2e_test/streaming/ch-benchmark/q13.slt.part
./risedev slt -p 4566 -d dev ./e2e_test/streaming/ch-benchmark/q16.slt.part

Notes

TPC-C tables are explicitly prefixed tpcc_. This is in order to avoid confusion since both TPC-H and TPC-C have a customers and order(s) table.
In the original CH-benCHmark queries the mathematical function MOD() is used. They are currently replaced by the modulo operator %.
Table name prefix of tpcc_stock table is changed from s_ to st_ in order to distinguish it from the supplier table name prefix (in tpc-h).

Checklist

[x] I have written necessary rustdoc comments
[x] I have added necessary unit tests and integration tests
[x] All checks passed in ./risedev check (or alias, ./risedev c)

Sep 06 '22 23:09 nanderstabel

Codecov Report

Merging #5155 (017c054) into main (5587547) will decrease coverage by 0.01%. The diff coverage is 70.19%.

@@            Coverage Diff             @@
##             main    #5155      +/-   ##
==========================================
- Coverage   74.26%   74.25%   -0.02%     
==========================================
  Files         906      906              
  Lines      142634   142970     +336     
==========================================
+ Hits       105925   106159     +234     
- Misses      36709    36811     +102

Flag	Coverage Δ
rust	`74.25% <70.19%> (-0.02%)`	:arrow_down:

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
src/batch/src/executor/expand.rs	`53.33% <ø> (ø)`
src/batch/src/executor/filter.rs	`86.63% <ø> (ø)`
src/batch/src/executor/join/nested_loop_join.rs	`91.90% <ø> (ø)`
src/batch/src/executor/join/sort_merge_join.rs	`79.12% <ø> (ø)`
src/batch/src/executor/order_by.rs	`95.21% <ø> (ø)`
src/batch/src/executor/row_seq_scan.rs	`17.59% <0.00%> (+0.08%)`	:arrow_up:
src/batch/src/executor/sys_row_seq_scan.rs	`0.00% <ø> (ø)`
src/batch/src/executor/top_n.rs	`79.72% <ø> (ø)`
src/batch/src/executor/values.rs	`97.68% <ø> (-0.02%)`	:arrow_down:
src/common/src/array/stream_chunk.rs	`85.00% <0.00%> (ø)`
... and 121 more

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

Sep 06 '22 23:09 codecov[bot]

Seems like there're some issues to be resolved. Would you please open a tracking issue for this? Like #1380.

Sep 07 '22 02:09 BugenZhao

Looks like only q13 and q16 were passed?

Most queries do pass (run without panic) but do not produce output due to limited amount of data in the TPC-C tables in the postgres database container. q13 and q16 run successfully as 'statements', but the data in this PR are only READS, not a whole bunch of UPDATES/DELETES/INSERTS yet. The updates in the simple transaction query I added in ci/ch-benchmark/transactions/example_update.sql are actually not even parsed (and thus skipped) by RisingWave. I have added these two queries more to demonstrate that the whole pipeline is functional. The actual content of the queries is not that important right now I guess. The logical steps regarding this benchmark imo are these:

Build the postgres->debezium->kafka->RisingWave pipeline (this PR)
Add this pipeline to the CI benchmark pipeline (Sumit will help me with this)
Add a whole bunch of transactions that include UPDATES/DELETES/INSERTS. We/I need to decide how complex to make these transactions. Probably the best way to go is to start off with some very simple transaction to catch the first bugs. At a later stage we can decide to add the actual 'official' TPC-C transactions that consist of multiple inserts/updates and randomly generated data (more details: https://www.tpc.org/tpc_documents_current_versions/pdf/tpc-c_v5.11.0.pdf#page=28).
After these steps are completed it makes sense to add more of the CH-benCHmark queries. I already 'translated' these queries locally. The main goal of this step would be to curate specific input data for each of those queries to produce actual output. However, like @fuyufjh mentioned: priority is to catch any bugs that arise just by ingesting update/delete input from kafka. This is unrelated to the actual CH-benCHmark queries. Only once most of those bugs are found and fixed it makes sense to add more queries and look at their behavior.

A bit lengthy response 😂 but it also helped to structure my thought. I will create a tracking issue for this 👍

Sep 13 '22 08:09 nanderstabel

I see, but you don't need to actually start Postgres and a Debezium server in CI, if the data has been pre-generated.

You can first run Postgres+Debezium+Kafka (Redpanda is recommended) locally, and ingest the prepared dataset into Pg. After a while, you will get a bunch of records flushed to Kafka (encoded in Debezium JSON). Then, you can output the records from Kafka to a single JSON file.

rpk topic consume -f '%v\n' --brokers 127.0.0.1:55971

Finally, the CI workflow can use the JSON file rather than the real Postgres as the test input. I think it would require no change to the current workflow.

Sep 13 '22 08:09 neverchanje

Yes, I agree that would be a better way to go about this (and to not change the current workflow) 🤔 Perhaps that the updates I made to docker-compose.yml can still act as a useful tool to generate records though. But I indeed think that step 2. can be skipped at least for now. So, I will continue with step 3. (utilizing redpanda to generate the JSON input directly) since that has the highest priority atm 👍

Sep 13 '22 08:09 nanderstabel

Yeah, just remove debezium and postgres for now in the docker-compose file. Add them back if really needed. (IDK if they will slow down the CI)

Sep 13 '22 09:09 neverchanje

I will create a tracking issue for this 👍

Have you create one? If so, please link it here. BTW, how about close (or update) this issue: https://github.com/risingwavelabs/risingwave/issues/5190

Sep 26 '22 02:09 fuyufjh

I will create a tracking issue for this 👍

Have you create one? If so, please link it here. BTW, how about close (or update) this issue: #5190

Yes, will update 👍

Sep 26 '22 07:09 nanderstabel

risingwave
risingwave copied to clipboard

test: introduce ch-benchmark

What's changed and what's your intention?

Demonstration

Notes

Checklist

Codecov Report

risingwave risingwave copied to clipboard

test: introduce ch-benchmark

What's changed and what's your intention?

Demonstration

Notes

Checklist

Codecov Report

risingwave
risingwave copied to clipboard