risingwave icon indicating copy to clipboard operation
risingwave copied to clipboard

nightly-20240130 nexmark-q5-many-windows perf degradation

Open cyliu0 opened this issue 1 year ago • 5 comments
trafficstars

Describe the bug

image

https://buildkite.com/risingwave-test/nexmark-benchmark/builds/2944#018d5b1a-81cd-4004-b5c1-21cf9024a263

https://grafana.test.risingwave-cloud.xyz/d/liz0yRCZz1/log-search-dashboard?orgId=1&var-data_source=Logging:%20test-useast1-eks-a&from=1706652288000&to=1706654091000&var-namespace=nexmark-bs-0-14-daily-20240130

Error message/log

No response

To Reproduce

No response

Expected behavior

No response

How did you deploy RisingWave?

No response

The version of RisingWave

nightly-20240130

Additional context

nightly-20240130

cyliu0 avatar Feb 05 '24 02:02 cyliu0

The throughput increased likely because of 0cd9ff11f0ac5eb10f06a1c2dde2b806334c261c https://github.com/risingwavelabs/risingwave/pull/14558

SCR-20240205-f40 https://github.com/risingwavelabs/rw-commits-history?tab=readme-ov-file#nightly-20240115

now drops because of 9417409957dbdc047cc758b6652ae5134a68a9b4 https://github.com/risingwavelabs/risingwave/pull/14855

14588 vs 14855, what a coincidence

lmatz avatar Feb 05 '24 02:02 lmatz

Interesting, but I can't explain this phenomenon. IMO this PR should only have an effect on UDF :)

TennyZhuang avatar Feb 05 '24 08:02 TennyZhuang

Interesting +1, Worth investigating the cause, I think

fuyufjh avatar Feb 19 '24 05:02 fuyufjh

Any update? The perf never came back to the original level http://metabase.risingwave-cloud.xyz/question/1304-nexmark-q5-many-windows-blackhole-medium-1cn-avg-source-output-rows-per-second-rows-s-history-thtb-266?start_date=2024-01-22

cyliu0 avatar Apr 30 '24 07:04 cyliu0

@TennyZhuang Any updates?

fuyufjh avatar May 08 '24 06:05 fuyufjh

CREATE SINK nexmark_q5_many_windows
AS
SELECT
    AuctionBids.auction, AuctionBids.num
FROM (
    SELECT
        bid.auction,
        count(*) AS num,
        window_start AS starttime
    FROM
        HOP(bid, date_time, INTERVAL '5' SECOND, INTERVAL '5' MINUTE)
    GROUP BY
        bid.auction,
        window_start
) AS AuctionBids
JOIN (
	SELECT
        max(CountBids.num) AS maxn,
        CountBids.starttime_c
	FROM (
		SELECT
            count(*) AS num,
            window_start AS starttime_c
		FROM
            HOP(bid, date_time, INTERVAL '5' SECOND, INTERVAL '5' MINUTE)
        GROUP BY
            bid.auction,
            window_start
		) AS CountBids
	GROUP BY
        CountBids.starttime_c
	) AS MaxBids
ON
    AuctionBids.starttime = MaxBids.starttime_c AND
    AuctionBids.num >= MaxBids.maxn
WITH ( connector = 'blackhole', type = 'append-only', force_append_only = 'true');

Plan:

 StreamSink { type: append-only, columns: [auction, num, window_start(hidden), window_start#1(hidden)] }
 └─StreamProject { exprs: [$expr1, count, window_start, window_start] }
   └─StreamFilter { predicate: (count >= max(count)) }
     └─StreamHashJoin { type: Inner, predicate: window_start = window_start }
       ├─StreamExchange { dist: HashShard(window_start) }
       │ └─StreamShare { id: 7 }
       │   └─StreamHashAgg [append_only] { group_key: [$expr1, window_start], aggs: [count] }
       │     └─StreamExchange { dist: HashShard($expr1, window_start) }
       │       └─StreamHopWindow { time_col: $expr2, slide: 00:00:05, size: 00:05:00, output: [$expr1, window_start, _row_id] }
       │         └─StreamProject { exprs: [Field(bid, 0:Int32) as $expr1, Field(bid, 5:Int32) as $expr2, _row_id] }
       │           └─StreamFilter { predicate: IsNotNull(Field(bid, 5:Int32)) AND (event_type = 2:Int32) }
       │             └─StreamRowIdGen { row_id_index: 4 }
       │               └─StreamSource { source: nexmark, columns: [event_type, person, auction, bid, _row_id] }
       └─StreamProject { exprs: [window_start, max(count)] }
         └─StreamHashAgg { group_key: [window_start], aggs: [max(count), count] }
           └─StreamExchange { dist: HashShard(window_start) }
             └─StreamShare { id: 7 }
               └─StreamHashAgg [append_only] { group_key: [$expr1, window_start], aggs: [count] }
                 └─StreamExchange { dist: HashShard($expr1, window_start) }
                   └─StreamHopWindow { time_col: $expr2, slide: 00:00:05, size: 00:05:00, output: [$expr1, window_start, _row_id] }
                     └─StreamProject { exprs: [Field(bid, 0:Int32) as $expr1, Field(bid, 5:Int32) as $expr2, _row_id] }
                       └─StreamFilter { predicate: IsNotNull(Field(bid, 5:Int32)) AND (event_type = 2:Int32) }
                         └─StreamRowIdGen { row_id_index: 4 }
                           └─StreamSource { source: nexmark, columns: [event_type, person, auction, bid, _row_id] }

fuyufjh avatar May 16 '24 10:05 fuyufjh

It's indeed caused by https://github.com/risingwavelabs/risingwave/pull/14558, but the reason is unknown. Will continue to investigate.

CPU flamegraph: profile results.zip

fuyufjh avatar May 17 '24 10:05 fuyufjh

It's indeed caused by #14558, but the reason is unknown. Will continue to investigate.

No results.

Let's close the issue as the problem was already solved.

fuyufjh avatar Jul 10 '24 08:07 fuyufjh