ClickHouse
ClickHouse copied to clipboard
SortingStep: deduce way to sort based on input stream sort description
Changelog category (leave one):
- Performance Improvement
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Deduce way to sort based on input stream sort description. Skip sorting if input stream is already sorted
@Mergifyio update
update
✅ Branch has been successfully updated
@nickitat @KochetovNicolai would you mind to review?
some of the possible bad cases from the chat with Vladimir https://pastila.nl/?009cde16/669061285f5478a836a4837d2b41ca00 added just to not be lost from the context
distinct_in_order
perf test seems to be unstable
I also would like to mention in the comments that we need to handle discussed case with aliases, just to not forgot about it.
02377_optimize_sorting_for_input_stream_explain - need to update reference
Found.
create table tab (x UInt64) engine = MergeTree order by x + 1;
insert into tab select number from numbers(4);
select x, y from (select x + 1 as y, sipHash64(x) as x from (select x from tab order by x + 1)) order by x + 1;
SELECT
x,
y
FROM
(
SELECT
x + 1 AS y,
sipHash64(x) AS x
FROM
(
SELECT x
FROM tab
ORDER BY x + 1 ASC
)
)
ORDER BY x + 1 ASC
Query id: a032735e-1364-4193-8156-6511b478dd67
┌────────────────────x─┬────────────────────y─┐
│ 16738165381834614119 │ 16738165381834614120 │
│ 9224715256000962398 │ 9224715256000962399 │
│ 13686418376000424449 │ 13686418376000424450 │
│ 14247234200463187066 │ 14247234200463187067 │
└──────────────────────┴──────────────────────┘
set optimize_sorting_for_input_stream=0;
select x, y from (select x + 1 as y, sipHash64(x) as x from (select x from tab order by x + 1)) order by x + 1;
SELECT
x,
y
FROM
(
SELECT
x + 1 AS y,
sipHash64(x) AS x
FROM
(
SELECT x
FROM tab
ORDER BY x + 1 ASC
)
)
ORDER BY x + 1 ASC
Query id: d4611d8c-890c-4ce0-82d5-396ed61b5592
┌────────────────────x─┬────────────────────y─┐
│ 9224715256000962398 │ 9224715256000962399 │
│ 13686418376000424449 │ 13686418376000424450 │
│ 14247234200463187066 │ 14247234200463187067 │
│ 16738165381834614119 │ 16738165381834614120 │
└──────────────────────┴──────────────────────┘
@KochetovNicolai Please check the fix. It addresses the case you provided
@KochetovNicolai @nickitat Please could you review it once more? It should be ready. I've added comments regarding possible follow-ups
"global" is a good name. "stream", I am afraid, will lead to confusion between DataStream and streams on pipeline level. imo "port" is ok. and "stream" also ok for me.
It looks to me as opposite. Regarding parallel processing, we're saying that we have several data streams processed in parallel - if each stream is sorted, then sorting scope (we call it sort mode) is stream. Port is pipeline level implementation detail.
Stress test failure is known issue #39857, see https://s3.amazonaws.com/clickhouse-test-reports/38719/fdd699e8a377c25bf8ff9e1d2c14f760cc79a4ee/stress_test__undefined_/fatal_messages.txt
"global" is a good name. "stream", I am afraid, will lead to confusion between DataStream and streams on pipeline level. imo "port" is ok. and "stream" also ok for me.
It looks to me as opposite. Regarding parallel processing, we're saying that we have several data streams processed in parallel - if each stream is sorted, then sorting scope (we call it sort mode) is stream. Port is pipeline level implementation detail.
See proposal for renaming #40305