risingwave
risingwave copied to clipboard
test: introduce ch-benchmark
I hereby agree to the terms of the Singularity Data, Inc. Contributor License Agreement.
What's changed and what's your intention?
To introduce CH-benCHmark pipeline from postgres to RisingWave
The CH-benCHmark relies on both TPC-H and TPC-C tables.
In this PR:
- postgres->debezium->kafka->RisingWave
- Not everything is automated yet in CI --> help needed
- In
ci/ch-benchmark/transactions/example_update.sql
two very simple update transactions are created - In
e2e_test/streaming/ch-benchmark/create_sources.slt.part
materialized sources are created for all TPC-C tables - Also in
e2e_test/streaming/ch-benchmark/
are two runnable CH-benCHmark tests provided
Demonstration
// Start docker containers and initialize debezium and postgres
docker compose -f ci/docker-compose.yml up -d && sleep 5
docker exec -it ci-debezium-1 sh /connectors/create-postgres-debezium-connector.sh >/dev/null 2>&1
docker cp ci/ch-benchmark/transactions/ ci-postgres-1:/transactions/
docker exec -u postgres ci-postgres-1 psql ch_benchmark_db postgres -f /transactions/example_update.sql >/dev/null 2>&1
// Start RW cluster
./risedev d
// Create and insert TPC-H tables
./risedev slt -p 4566 -d dev ./e2e_test/tpch/create_tables.slt.part
./risedev slt -p 4566 -d dev ./e2e_test/tpch/**/insert_*.slt.part
// Create materialized sources
./risedev slt -p 4566 -d dev ./e2e_test/streaming/ch-benchmark/create_sources.slt.part
// Run two CH-benCHmark queries
./risedev slt -p 4566 -d dev ./e2e_test/streaming/ch-benchmark/q13.slt.part
./risedev slt -p 4566 -d dev ./e2e_test/streaming/ch-benchmark/q16.slt.part
Notes
- TPC-C tables are explicitly prefixed
tpcc_
. This is in order to avoid confusion since both TPC-H and TPC-C have a customers and order(s) table. - In the original CH-benCHmark queries the mathematical function
MOD()
is used. They are currently replaced by the modulo operator%
. - Table name prefix of
tpcc_stock
table is changed froms_
tost_
in order to distinguish it from thesupplier
table name prefix (in tpc-h).
Checklist
- [x] I have written necessary rustdoc comments
- [x] I have added necessary unit tests and integration tests
- [x] All checks passed in
./risedev check
(or alias,./risedev c
)
Codecov Report
Merging #5155 (017c054) into main (5587547) will decrease coverage by
0.01%
. The diff coverage is70.19%
.
@@ Coverage Diff @@
## main #5155 +/- ##
==========================================
- Coverage 74.26% 74.25% -0.02%
==========================================
Files 906 906
Lines 142634 142970 +336
==========================================
+ Hits 105925 106159 +234
- Misses 36709 36811 +102
Flag | Coverage Δ | |
---|---|---|
rust | 74.25% <70.19%> (-0.02%) |
:arrow_down: |
Flags with carried forward coverage won't be shown. Click here to find out more.
Impacted Files | Coverage Δ | |
---|---|---|
src/batch/src/executor/expand.rs | 53.33% <ø> (ø) |
|
src/batch/src/executor/filter.rs | 86.63% <ø> (ø) |
|
src/batch/src/executor/join/nested_loop_join.rs | 91.90% <ø> (ø) |
|
src/batch/src/executor/join/sort_merge_join.rs | 79.12% <ø> (ø) |
|
src/batch/src/executor/order_by.rs | 95.21% <ø> (ø) |
|
src/batch/src/executor/row_seq_scan.rs | 17.59% <0.00%> (+0.08%) |
:arrow_up: |
src/batch/src/executor/sys_row_seq_scan.rs | 0.00% <ø> (ø) |
|
src/batch/src/executor/top_n.rs | 79.72% <ø> (ø) |
|
src/batch/src/executor/values.rs | 97.68% <ø> (-0.02%) |
:arrow_down: |
src/common/src/array/stream_chunk.rs | 85.00% <0.00%> (ø) |
|
... and 121 more |
:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more
Seems like there're some issues to be resolved. Would you please open a tracking issue for this? Like #1380.
Looks like only q13 and q16 were passed?
Most queries do pass (run without panic) but do not produce output due to limited amount of data in the TPC-C tables in the postgres database container. q13 and q16 run successfully as 'statements', but the data in this PR are only READS, not a whole bunch of UPDATES/DELETES/INSERTS yet. The updates in the simple transaction query I added in ci/ch-benchmark/transactions/example_update.sql
are actually not even parsed (and thus skipped) by RisingWave. I have added these two queries more to demonstrate that the whole pipeline is functional. The actual content of the queries is not that important right now I guess. The logical steps regarding this benchmark imo are these:
- Build the postgres->debezium->kafka->RisingWave pipeline (this PR)
- Add this pipeline to the CI benchmark pipeline (Sumit will help me with this)
- Add a whole bunch of transactions that include UPDATES/DELETES/INSERTS. We/I need to decide how complex to make these transactions. Probably the best way to go is to start off with some very simple transaction to catch the first bugs. At a later stage we can decide to add the actual 'official' TPC-C transactions that consist of multiple inserts/updates and randomly generated data (more details: https://www.tpc.org/tpc_documents_current_versions/pdf/tpc-c_v5.11.0.pdf#page=28).
- After these steps are completed it makes sense to add more of the CH-benCHmark queries. I already 'translated' these queries locally. The main goal of this step would be to curate specific input data for each of those queries to produce actual output. However, like @fuyufjh mentioned: priority is to catch any bugs that arise just by ingesting update/delete input from kafka. This is unrelated to the actual CH-benCHmark queries. Only once most of those bugs are found and fixed it makes sense to add more queries and look at their behavior.
A bit lengthy response 😂 but it also helped to structure my thought. I will create a tracking issue for this 👍
I see, but you don't need to actually start Postgres and a Debezium server in CI, if the data has been pre-generated.
You can first run Postgres+Debezium+Kafka (Redpanda is recommended) locally, and ingest the prepared dataset into Pg. After a while, you will get a bunch of records flushed to Kafka (encoded in Debezium JSON). Then, you can output the records from Kafka to a single JSON file.
rpk topic consume -f '%v\n' --brokers 127.0.0.1:55971
Finally, the CI workflow can use the JSON file rather than the real Postgres as the test input. I think it would require no change to the current workflow.
Yes, I agree that would be a better way to go about this (and to not change the current workflow) 🤔 Perhaps that the updates I made to docker-compose.yml can still act as a useful tool to generate records though. But I indeed think that step 2. can be skipped at least for now. So, I will continue with step 3. (utilizing redpanda to generate the JSON input directly) since that has the highest priority atm 👍
Yeah, just remove debezium and postgres for now in the docker-compose file. Add them back if really needed. (IDK if they will slow down the CI)
I will create a tracking issue for this 👍
Have you create one? If so, please link it here. BTW, how about close (or update) this issue: https://github.com/risingwavelabs/risingwave/issues/5190
I will create a tracking issue for this 👍
Have you create one? If so, please link it here. BTW, how about close (or update) this issue: #5190
Yes, will update 👍