ClickHouse icon indicating copy to clipboard operation
ClickHouse copied to clipboard

insertion deduplication on retries for materialised views

Open CheSema opened this issue 1 year ago • 5 comments

Implements ideas from https://github.com/ClickHouse/ClickHouse/issues/60008 Docs in progress https://github.com/ClickHouse/clickhouse-docs/pull/2394

I improved deduplication by enhancing annotation of chunks on a pipeline level. Now, each chunk could have several attached structures with base class ChunkInfo which are differ by the derived type. That annotation is passing with the chunks through the Processors. See Chunk::ChunkInfoCollection, CollectionOfDerivedItems<ChunkInfo>.

The deduplication token for each chunk is written as TokenInfo (derived class from ChunkInfo) with SetInitialTokenTransform. After that token could be updated. See DeduplicationToken::TokenInfo::BuildingStage.

Initial value for TokenInfo is taken either from insert_deduplication_token setting or it is calculated as a hash from inserted data. In order to distinguish equal blocks which should not be deduplicated, TokenInfo is update with more detailed information about the source of the data, like the names of MV on the way to the table.

Changelog category (leave one):

  • Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

This PR changes how deduplication for MV works. Fixed a lot of cases like:

  • on destination table: data is split for 2 or more blocks and that blocks is considered as duplicate when that block is inserted in parallel.
  • on MV destination table: the equal blocks are deduplicated, that happens when MV often produces equal data as a result for different input data due to performing aggregation.
  • on MV destination table: the equal blocks which comes from different MV are deduplicated

Settings update_insert_deduplication_token_in_dependent_materialized_views is depricated. The deduplicated token for inserted blocks in MV is calculated based on source data. Always.

CheSema avatar Mar 19 '24 15:03 CheSema

This is an automated comment for commit 438fd899236b15468828c3dec751081fd07325d6 with description of existing statuses. It's updated for the latest CI running

❌ Click here to open a full report in a separate page

Check nameDescriptionStatus
Flaky testsChecks if new added or modified tests are flaky by running them repeatedly, in parallel, with more randomization. Functional tests are run 100 times with address sanitizer, and additional randomization of thread scheduling. Integration tests are run up to 10 times. If at least once a new test has failed, or was too long, this check will be red. We don't allow flaky tests, read the doc❌ failure
Integration testsThe integration tests report. In parenthesis the package type is given, and in square brackets are the optional part/total tests❌ failure
Successful checks
Check nameDescriptionStatus
BuildsThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Docs checkBuilds and tests the documentation✅ success
Fast testNormally this is the first check that is ran for a PR. It builds ClickHouse and runs most of stateless functional tests, omitting some. If it fails, further checks are not started until it is fixed. Look at the report to see which tests fail, then reproduce the failure locally as described here✅ success
Stateful testsRuns stateful functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc✅ success
Stateless testsRuns stateless functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc✅ success
Style checkRuns a set of checks to keep the code style clean. If some of tests failed, see the related log from the report✅ success
Unit testsRuns the unit tests for different release types✅ success

robot-ch-test-poll avatar Mar 19 '24 15:03 robot-ch-test-poll

Great. Fast tests are passed!

CheSema avatar May 22 '24 13:05 CheSema

What has left:

  • CREATE WINDOW VIEW wv TO dst POPULATE AS SELECT - done
  • rechain how Synks consumes chunks, they have to split chunk into a series as it split by partitioning or mark chunk with hashes from all partitions -- done
  • docs

CheSema avatar May 31 '24 16:05 CheSema

At this commit 3db3b36 CI has passed, well almost passed. Except: Tidy build (fixed) and Stateless tests flaky check (asan).

I'm not going to fight for Stateless tests flaky check (asan). Stateless tests (asan) have passed it is just enough.

CheSema avatar Jun 10 '24 20:06 CheSema

I tried to add no-parallel tag to the new tests. Particular that test 03008_deduplication_mv_generates_several_blocks_nonreplicated does 64 probs with different settings. In each probs it could create up to 100 new partitions. I see that sometimes it timed out, no problems in the logs, just slow.

CheSema avatar Jun 17 '24 14:06 CheSema

The only concern here is test_storage_s3_queue/test.py::test_multiple_tables_streaming_sync_distributed[ordered] It failed in this PR for the first time on insignificant change.

That test flasks in other PR with no clear relation to the changes https://play.clickhouse.com/play?user=play#c2VsZWN0IAp0b1N0YXJ0T2ZIb3VyKGNoZWNrX3N0YXJ0X3RpbWUpIGFzIGQsCmNvdW50KCksICBncm91cFVuaXFBcnJheShwdWxsX3JlcXVlc3RfbnVtYmVyKSwgIGFueShyZXBvcnRfdXJsKQpmcm9tIGNoZWNrcyB3aGVyZSAnMjAyNC0wMS0wMScgPD0gY2hlY2tfc3RhcnRfdGltZSBhbmQgdGVzdF9uYW1lIGxpa2UgJyV0ZXN0X3N0b3JhZ2VfczNfcXVldWUvdGVzdC5weTo6dGVzdF9tdWx0aXBsZV90YWJsZXNfc3RyZWFtaW5nX3N5bmNfZGlzdHJpYnV0ZWRbb3JkZXJlZF0lJyBhbmQgdGVzdF9zdGF0dXMgaW4gKCdGQUlMJywgJ0ZMQUtZJykgZ3JvdXAgYnkgZCBvcmRlciBieSBkIGRlc2M=

From the test logic, it requires that the both S3Queue from different nodes read some files. From my point of view it is a race condition, it is legal that some times only the one of the two nodes processed all the files.

CheSema avatar Jul 04 '24 12:07 CheSema

03172_error_log_table_not_empty -- is flaky, fixing it in https://github.com/ClickHouse/ClickHouse/pull/66093

CheSema avatar Jul 04 '24 13:07 CheSema

It is interesting

chassert(isUniqTypes()); -- https://s3.amazonaws.com/clickhouse-test-reports/66093/78a2139f2a43752196a029995b6965ada359c954/stress_test__tsan_.html https://github.com/ClickHouse/ClickHouse/issues/66122

Upgrade Check -- Changed settings are not reflected in settings changes history (see changed_settings.txt): update_insert_deduplication_token_in_dependent_materialized_views

01275_parallel_mv -- flaks https://s3.amazonaws.com/clickhouse-test-reports/66093/78a2139f2a43752196a029995b6965ada359c954/stateless_tests__aarch64_.html

I did not see it in CI here!

CheSema avatar Jul 05 '24 10:07 CheSema

It is interesting

You only run a partial CI, not full. 73 successful, 4 skipped, and 8 failing checks Test_3 was never run, because Test_2 was never fully successful.

Algunenano avatar Jul 05 '24 12:07 Algunenano

00002_log_and_exception_messages_formatting was also affected by the introduction of a new noisy log debug: {}, token: {}

Algunenano avatar Jul 05 '24 12:07 Algunenano

It is interesting

You only run a partial CI, not full. 73 successful, 4 skipped, and 8 failing checks Test_3 was never run, because Test_2 was never fully successful.

That is sad. I'm reverting this change.

CheSema avatar Jul 05 '24 12:07 CheSema