risingwave icon indicating copy to clipboard operation
risingwave copied to clipboard

metrics(compaction): Huge compaction cancelled count showed in grafana

Open hzxa21 opened this issue 2 years ago • 9 comments

Describe the bug

I saw a huge number of cancelled compaction task when running nexmark_q7, which looks abonormal. However, source throughput is not affected. If this is expected, I think we should have a separate panel for cancelled compaction task to avoid missing it up with success/ailed compaction task.

image

To Reproduce

  1. Use risedev-bench to start a 3-cn cluster.
  2. Create sources:
CREATE SOURCE person (
    "id" BIGINT,
    "name" VARCHAR,
    "email_address" VARCHAR,
    "credit_card" VARCHAR,
    "city" VARCHAR,
    "state" VARCHAR,
    "date_time" TIMESTAMP,
    "extra" VARCHAR
) with (
    connector = 'nexmark',
    nexmark.table.type = 'Person',
    nexmark.split.num = '8',
    nexmark.min.event.gap.in.ns = '0'
) ROW FORMAT JSON;

CREATE SOURCE auction (
    "id" BIGINT,
    "item_name" VARCHAR,
    "description" VARCHAR,
    "initial_bid" BIGINT,
    "reserve" BIGINT,
    "date_time" TIMESTAMP,
    "expires" TIMESTAMP,
    "seller" BIGINT,
    "category" BIGINT,
    "extra" VARCHAR
) with (
    connector = 'nexmark',
    nexmark.table.type = 'Auction',
    nexmark.split.num = '8',
    nexmark.min.event.gap.in.ns = '0'
) ROW FORMAT JSON;

CREATE SOURCE bid (
    "auction" BIGINT,
    "bidder" BIGINT,
    "price" BIGINT,
    "channel" VARCHAR,
    "url" VARCHAR,
    "date_time" TIMESTAMP,
    "extra" VARCHAR
) with (
    connector = 'nexmark',
    nexmark.table.type = 'Bid',
    nexmark.split.num = '8',
    nexmark.min.event.gap.in.ns = '0'
) ROW FORMAT JSON;
  1. Create MView:
CREATE MATERIALIZED VIEW nexmark_q7
AS
SELECT
    B.auction,
    B.price,
    B.bidder,
    B.date_time
from
    bid B
    JOIN (
        SELECT
            MAX(price) AS maxprice,
            window_end as date_time
        FROM
            TUMBLE(bid, date_time, INTERVAL '10' SECOND)
        GROUP BY
            window_end
    ) B1 ON B.price = B1.maxprice
WHERE
    B.date_time BETWEEN B1.date_time - INTERVAL '10' SECOND
    AND B1.date_time;
  1. Wait for 20mins

Expected behavior

No response

Additional context

No response

hzxa21 avatar Sep 13 '22 09:09 hzxa21

I think the cancellation is caused by no idle compactor, which is expected.

zwang28 avatar Sep 14 '22 06:09 zwang28

I think the cancellation is caused by no idle compactor, which is expected.

I see. However, repeatedly triggering compaction and then cancelling it is kind of a waste of work if we don't have idle compactor. I think #5321 can help to reduce the cancellation count.

hzxa21 avatar Sep 14 '22 07:09 hzxa21

I feel like we need to seperate the report label (TaskStatus)to distinguish the failure and cancelation in grafana. This helps us to find problems more directly. (#5430

Li0k avatar Sep 14 '22 07:09 Li0k

after #5405

image

Indeed, the reported issue has been fixed. But we can observe that the number of l0 sst is large, and duration of each compact is about 10s. IMO, the default config of no_available_compactor_stall_sec is a little longer.

revert the change and bench again

image

Unfortunately, we found no significant change in l0 sst count

Li0k avatar Sep 16 '22 13:09 Li0k

after #5405

image

Indeed, the reported issue has been fixed. But we can observe that the number of l0 sst is large, and duration of each compact is about 10s. IMO, the default config of no_available_compactor_stall_sec is a little longer.

revert the change and bench again

image

Unfortunately, we found no significant change in l0 sst count

Some other observations:

  • Compaction Throughput has increased after revert. It's strange that throughput is higher but SST count is similar.
  • Compactor Running Task Count is much more stable at the maximum limit before this PR. So yes, 10s might be a tad too long.

Though I haven't paid special attention to the dashboard, but I suspect after #5405, the CPU usage of meta node should be lower when there's no idle compactor, and the log has certainly become much tidier 🤣 .

After resolving #5321, I would expect zero or little tasks canceled because of no idle compactor depending on the implementation, and a stable running task count on the compactor, constantly hitting the maximum limit.

Gun9niR avatar Sep 16 '22 15:09 Gun9niR

I ran a same bench as Li, with both f903d4a6, and an earlier commit f1a84e5d that excludes recent refactor on compaction scheduler(no overwhelming cancellation problem). The L0 file number piles up in both cases, though f903d4a6 is worse, with almost doubled L0 file number. Note that there are several other commits between f903d4a6 and f1a84e5d that can affect compaction efficiency.

./bin/bench --type=tpch-std --qps=800000 --scale 80.0 --query 20 --partition 12 \
    $(cat ~/risingwave-deploy/tpch-bench-args-kafka) \
    $(cat ~/risingwave-deploy/tpch-bench-args-frontend) \
    --prepare-sql ./assets/data/create_materialized_source.sql

Result of f1a84e5d (earlier commit) f1a84e5d

Result of f903d4a6 (no_available_compactor_stall_sec=5) f903d4a6

zwang28 avatar Sep 17 '22 02:09 zwang28

~~I wonder if the L0 file piling up is caused by CPU contention, as we deploy compactor and source in the same EC2 instance, and its CPU usage is close to 100%.~~ ~~Will confirm later.~~ f903d4a6 (no_available_compactor_stall_sec=5), and each compactor deployed in dedicated 16vCPU instance. with 1 compactor, the L0 file number still piles up. But the max number of L0 file is smaller. with 2 compactor, the L0 file number looks good now. But source throughput is lower. Screen Shot 2022-09-17 at 12 27 54 PM

Full metrics

localhost_3001_d_Ecy3uV1nz_risingwave_dashboard_orgId=1 refresh=10s from=now-1h to=now

zwang28 avatar Sep 17 '22 03:09 zwang28

Note that there are several other commits between base version f1a84e5d and current version f903d4a6 that can affect compaction behavior. So we compare 1) f1a84e5d with 2) f1a84e5d + cherry pick 3afa7cfb and f903d4a6 (no_available_compactor_stall_sec=5). The result shows max L0 file number from both test are 600~800. So I think 3afa7cfb + f903d4a6 itself doesn't introduce regression.

zwang28 avatar Sep 19 '22 11:09 zwang28

But source throughput is lower.

What could be the cause of this? Cache invalidation?

Seems that after a while, the throughput goes back up?

jon-chuang avatar Sep 23 '22 03:09 jon-chuang