ClickHouse icon indicating copy to clipboard operation
ClickHouse copied to clipboard

SortingStep: deduce way to sort based on input stream sort description

Open devcrafter opened this issue 2 years ago • 8 comments

Changelog category (leave one):

  • Performance Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Deduce way to sort based on input stream sort description. Skip sorting if input stream is already sorted

devcrafter avatar Jul 01 '22 19:07 devcrafter

@Mergifyio update

devcrafter avatar Jul 08 '22 15:07 devcrafter

update

✅ Branch has been successfully updated

mergify[bot] avatar Jul 08 '22 15:07 mergify[bot]

@nickitat @KochetovNicolai would you mind to review?

devcrafter avatar Jul 19 '22 19:07 devcrafter

some of the possible bad cases from the chat with Vladimir https://pastila.nl/?009cde16/669061285f5478a836a4837d2b41ca00 added just to not be lost from the context

nickitat avatar Jul 21 '22 11:07 nickitat

distinct_in_order perf test seems to be unstable

nickitat avatar Jul 21 '22 13:07 nickitat

I also would like to mention in the comments that we need to handle discussed case with aliases, just to not forgot about it.

nickitat avatar Jul 21 '22 15:07 nickitat

02377_optimize_sorting_for_input_stream_explain - need to update reference

qoega avatar Aug 09 '22 14:08 qoega

Found.

create table tab (x UInt64) engine = MergeTree order by x + 1;
insert into tab select number from numbers(4);
select x, y from (select x + 1 as y, sipHash64(x) as x from (select x from tab order by x + 1)) order by x + 1;

SELECT
    x,
    y
FROM
(
    SELECT
        x + 1 AS y,
        sipHash64(x) AS x
    FROM
    (
        SELECT x
        FROM tab
        ORDER BY x + 1 ASC
    )
)
ORDER BY x + 1 ASC

Query id: a032735e-1364-4193-8156-6511b478dd67

┌────────────────────x─┬────────────────────y─┐
│ 16738165381834614119 │ 16738165381834614120 │
│  9224715256000962398 │  9224715256000962399 │
│ 13686418376000424449 │ 13686418376000424450 │
│ 14247234200463187066 │ 14247234200463187067 │
└──────────────────────┴──────────────────────┘
set optimize_sorting_for_input_stream=0;
select x, y from (select x + 1 as y, sipHash64(x) as x from (select x from tab order by x + 1)) order by x + 1;

SELECT
    x,
    y
FROM
(
    SELECT
        x + 1 AS y,
        sipHash64(x) AS x
    FROM
    (
        SELECT x
        FROM tab
        ORDER BY x + 1 ASC
    )
)
ORDER BY x + 1 ASC

Query id: d4611d8c-890c-4ce0-82d5-396ed61b5592

┌────────────────────x─┬────────────────────y─┐
│  9224715256000962398 │  9224715256000962399 │
│ 13686418376000424449 │ 13686418376000424450 │
│ 14247234200463187066 │ 14247234200463187067 │
│ 16738165381834614119 │ 16738165381834614120 │
└──────────────────────┴──────────────────────┘

KochetovNicolai avatar Aug 09 '22 19:08 KochetovNicolai

@KochetovNicolai Please check the fix. It addresses the case you provided

devcrafter avatar Aug 12 '22 15:08 devcrafter

@KochetovNicolai @nickitat Please could you review it once more? It should be ready. I've added comments regarding possible follow-ups

devcrafter avatar Aug 14 '22 19:08 devcrafter

"global" is a good name. "stream", I am afraid, will lead to confusion between DataStream and streams on pipeline level. imo "port" is ok. and "stream" also ok for me.

It looks to me as opposite. Regarding parallel processing, we're saying that we have several data streams processed in parallel - if each stream is sorted, then sorting scope (we call it sort mode) is stream. Port is pipeline level implementation detail.

devcrafter avatar Aug 16 '22 13:08 devcrafter

Stress test failure is known issue #39857, see https://s3.amazonaws.com/clickhouse-test-reports/38719/fdd699e8a377c25bf8ff9e1d2c14f760cc79a4ee/stress_test__undefined_/fatal_messages.txt

devcrafter avatar Aug 17 '22 10:08 devcrafter

"global" is a good name. "stream", I am afraid, will lead to confusion between DataStream and streams on pipeline level. imo "port" is ok. and "stream" also ok for me.

It looks to me as opposite. Regarding parallel processing, we're saying that we have several data streams processed in parallel - if each stream is sorted, then sorting scope (we call it sort mode) is stream. Port is pipeline level implementation detail.

See proposal for renaming #40305

devcrafter avatar Aug 17 '22 12:08 devcrafter