ClickHouse
                                
                                 ClickHouse copied to clipboard
                                
                                    ClickHouse copied to clipboard
                            
                            
                            
                        insertion deduplication on retries for materialised views
Implements ideas from https://github.com/ClickHouse/ClickHouse/issues/60008 Docs in progress https://github.com/ClickHouse/clickhouse-docs/pull/2394
I improved deduplication by enhancing annotation of chunks on a pipeline level.
Now, each chunk could have several attached structures with base class ChunkInfo which are differ by the derived type. That annotation is passing with the chunks through the Processors. See Chunk::ChunkInfoCollection, CollectionOfDerivedItems<ChunkInfo>.
The deduplication token for each chunk is written as TokenInfo (derived class from ChunkInfo)  with SetInitialTokenTransform. After that token could be updated. See DeduplicationToken::TokenInfo::BuildingStage.
Initial value for TokenInfo is taken either from insert_deduplication_token setting or it is calculated as a hash from inserted data.
In order to distinguish equal blocks which should not be deduplicated, TokenInfo is update with more detailed information about the source of the data, like the names of MV on the way to the table.
Changelog category (leave one):
- Improvement
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
This PR changes how deduplication for MV works. Fixed a lot of cases like:
- on destination table: data is split for 2 or more blocks and that blocks is considered as duplicate when that block is inserted in parallel.
- on MV destination table: the equal blocks are deduplicated, that happens when MV often produces equal data as a result for different input data due to performing aggregation.
- on MV destination table: the equal blocks which comes from different MV are deduplicated
Settings update_insert_deduplication_token_in_dependent_materialized_views is depricated. The deduplicated token for inserted blocks in MV is calculated based on source data. Always.
This is an automated comment for commit 438fd899236b15468828c3dec751081fd07325d6 with description of existing statuses. It's updated for the latest CI running
❌ Click here to open a full report in a separate page
| Check name | Description | Status | 
|---|---|---|
| Flaky tests | Checks if new added or modified tests are flaky by running them repeatedly, in parallel, with more randomization. Functional tests are run 100 times with address sanitizer, and additional randomization of thread scheduling. Integration tests are run up to 10 times. If at least once a new test has failed, or was too long, this check will be red. We don't allow flaky tests, read the doc | ❌ failure | 
| Integration tests | The integration tests report. In parenthesis the package type is given, and in square brackets are the optional part/total tests | ❌ failure | 
Successful checks
| Check name | Description | Status | 
|---|---|---|
| Builds | There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS | ✅ success | 
| Docs check | Builds and tests the documentation | ✅ success | 
| Fast test | Normally this is the first check that is ran for a PR. It builds ClickHouse and runs most of stateless functional tests, omitting some. If it fails, further checks are not started until it is fixed. Look at the report to see which tests fail, then reproduce the failure locally as described here | ✅ success | 
| Stateful tests | Runs stateful functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc | ✅ success | 
| Stateless tests | Runs stateless functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc | ✅ success | 
| Style check | Runs a set of checks to keep the code style clean. If some of tests failed, see the related log from the report | ✅ success | 
| Unit tests | Runs the unit tests for different release types | ✅ success | 
Great. Fast tests are passed!
What has left:
- CREATE WINDOW VIEW wv TO dst POPULATE AS SELECT - done
- rechain how Synks consumes chunks, they have to split chunk into a series as it split by partitioning or mark chunk with hashes from all partitions -- done
- docs
At this commit 3db3b36 CI has passed, well almost passed. Except: Tidy build (fixed) and Stateless tests flaky check (asan).
I'm not going to fight for Stateless tests flaky check (asan).
Stateless tests (asan) have passed it is just enough.
I tried to add no-parallel tag to the new tests.
Particular that test 03008_deduplication_mv_generates_several_blocks_nonreplicated does 64 probs with different settings. In each probs it could create up to 100 new partitions. I see that sometimes it timed out, no problems in the logs, just slow.
The only concern here is test_storage_s3_queue/test.py::test_multiple_tables_streaming_sync_distributed[ordered]
It failed in this PR for the first time on insignificant change.
That test flasks in other PR with no clear relation to the changes https://play.clickhouse.com/play?user=play#c2VsZWN0IAp0b1N0YXJ0T2ZIb3VyKGNoZWNrX3N0YXJ0X3RpbWUpIGFzIGQsCmNvdW50KCksICBncm91cFVuaXFBcnJheShwdWxsX3JlcXVlc3RfbnVtYmVyKSwgIGFueShyZXBvcnRfdXJsKQpmcm9tIGNoZWNrcyB3aGVyZSAnMjAyNC0wMS0wMScgPD0gY2hlY2tfc3RhcnRfdGltZSBhbmQgdGVzdF9uYW1lIGxpa2UgJyV0ZXN0X3N0b3JhZ2VfczNfcXVldWUvdGVzdC5weTo6dGVzdF9tdWx0aXBsZV90YWJsZXNfc3RyZWFtaW5nX3N5bmNfZGlzdHJpYnV0ZWRbb3JkZXJlZF0lJyBhbmQgdGVzdF9zdGF0dXMgaW4gKCdGQUlMJywgJ0ZMQUtZJykgZ3JvdXAgYnkgZCBvcmRlciBieSBkIGRlc2M=
From the test logic, it requires that the both S3Queue from different nodes read some files. From my point of view it is a race condition, it is legal that some times only the one of the two nodes processed all the files.
03172_error_log_table_not_empty -- is flaky, fixing it in https://github.com/ClickHouse/ClickHouse/pull/66093
It is interesting
chassert(isUniqTypes()); --
https://s3.amazonaws.com/clickhouse-test-reports/66093/78a2139f2a43752196a029995b6965ada359c954/stress_test__tsan_.html
https://github.com/ClickHouse/ClickHouse/issues/66122
Upgrade Check -- Changed settings are not reflected in settings changes history (see changed_settings.txt): update_insert_deduplication_token_in_dependent_materialized_views
01275_parallel_mv -- flaks https://s3.amazonaws.com/clickhouse-test-reports/66093/78a2139f2a43752196a029995b6965ada359c954/stateless_tests__aarch64_.html
I did not see it in CI here!
It is interesting
You only run a partial CI, not full. 73 successful, 4 skipped, and 8 failing checks Test_3 was never run, because Test_2 was never fully successful.
00002_log_and_exception_messages_formatting was also affected by the introduction of a new noisy log debug: {}, token: {}  
It is interesting
You only run a partial CI, not full.
73 successful, 4 skipped, and 8 failing checksTest_3 was never run, because Test_2 was never fully successful.
That is sad. I'm reverting this change.