risingwave icon indicating copy to clipboard operation
risingwave copied to clipboard

v1.7.0-rc/nightly-20240201 source throughput down to 0 with non-shared PG CDC sources

Open cyliu0 opened this issue 1 year ago • 10 comments

Describe the bug

Run ch-benchmark with non-shared PG CDC sources with v1.7.0-rc/nightly-20240201

v1.7.0-rc: https://buildkite.com/risingwave-test/ch-benchmark-pg-cdc/builds/187 nightly-20240201: https://buildkite.com/risingwave-test/ch-benchmark-pg-cdc/builds/188

The buildkite pipeline jobs failed the data consistency check before they completed the data sync. Because the data consistency check would start after the source throughput down to 0 for 60 seconds.

Grafana

image image image

Error message/log

No response

To Reproduce

No response

Expected behavior

No response

How did you deploy RisingWave?

No response

The version of RisingWave

v1.7.0-rc nightly-20240201

Additional context

nightly-20240201

cyliu0 avatar Feb 02 '24 02:02 cyliu0

Update: revert #14899 also reproduce the problem, investigating other PRs in the list.

StrikeW avatar Feb 04 '24 06:02 StrikeW

I found that the stream query in the passed job generated much less data compared with failed jobs. image

And there is join amplification in the failed jobs: image

I suspect the workload has changed and I rerun the pipeline with nightly-20240131 again, then it also experience barrier piled up as those failed jobs.

So I think the pipeline failure is not caused by the code change. cc @lmatz if you have other information.

StrikeW avatar Feb 04 '24 10:02 StrikeW

thanks for the findings, let us check if there is any changes on the pipeline side

lmatz avatar Feb 04 '24 11:02 lmatz

Recently, I've added the ch-benchmark q3 back to the pipeline. The q3 has been removed since https://github.com/risingwavelabs/risingwave/issues/12777

So I think it's still a problem?

cyliu0 avatar Feb 04 '24 14:02 cyliu0

Reran the queries except q3 with v1.7.0-rc-1 passed https://buildkite.com/risingwave-test/ch-benchmark-pg-cdc/builds/199

cyliu0 avatar Feb 05 '24 10:02 cyliu0

Hit again for v1.7.0-rc-1 with following queries

CH_BENCHMARK_QUERY="q1,q2,q4,q5,q6,q9,q10,q11,q12,q13,q14,q15,q16,q17,q18,q19,q20,q22"

https://buildkite.com/risingwave-test/ch-benchmark-pg-cdc/builds/201 https://grafana.test.risingwave-cloud.xyz/d/EpkBw5W4k/risingwave-dev-dashboard?orgId=1&from=1707121771852&to=1707137671126&var-datasource=ebec273b-0774-4ccd-90a9-c2a22144d623&var-namespace=ch-benchmark-pg-cdc-20240205-085511&var-instance=benchmark-risingwave&var-pod=All&var-component=All&var-table=All

cyliu0 avatar Feb 06 '24 03:02 cyliu0

Hit again with nightly-20240207

https://buildkite.com/risingwave-test/ch-benchmark-pg-cdc/builds/205#018d8597-b39f-48ed-bbea-9f1b27b911c8

https://grafana.test.risingwave-cloud.xyz/d/EpkBw5W4k/risingwave-dev-dashboard?orgId=1&var-datasource=ebec273b-0774-4ccd-90a9-c2a22144d623&var-namespace=ch-benchmark-pg-cdc-daily-20240207&var-instance=benchmark-risingwave&var-pod=All&var-component=All&var-table=All&from=1707342785684&to=1707348581400

cyliu0 avatar Feb 08 '24 02:02 cyliu0

https://grafana.test.risingwave-cloud.xyz/d/EpkBw5W4k/risingwave-dev-dashboard?orgId=1&var-datasource=ebec273b-0774-4ccd-90a9-c2a22144d623&var-namespace=ch-benchmark-pg-cdc-daily-20240207&var-instance=benchmark-risingwave&var-pod=All&var-component=All&var-table=All&from=1707342785684&to=1707348581400

SCR-20240219-f6h

Why the compaction duration is so high?

lmatz avatar Feb 19 '24 02:02 lmatz

Hit again with nightly-20240207

https://buildkite.com/risingwave-test/ch-benchmark-pg-cdc/builds/205#018d8597-b39f-48ed-bbea-9f1b27b911c8

https://grafana.test.risingwave-cloud.xyz/d/EpkBw5W4k/risingwave-dev-dashboard?orgId=1&var-datasource=ebec273b-0774-4ccd-90a9-c2a22144d623&var-namespace=ch-benchmark-pg-cdc-daily-20240207&var-instance=benchmark-risingwave&var-pod=All&var-component=All&var-table=All&from=1707342785684&to=1707348581400

The symptom is same as the conclusion in https://github.com/risingwavelabs/risingwave/issues/14943#issuecomment-1925692637:

  • Join amplification
  • Many L0 files

which causes barrier piled up and backpressures the the source.

StrikeW avatar Feb 19 '24 03:02 StrikeW

which causes barrier piled up and backpressures the the source.

Remove it from blockers for now.

Join amplification Many L0 files

Join amplification is expected as it is determined by the nature of the query and data but wonder why many L0 files

lmatz avatar Feb 19 '24 03:02 lmatz

Ping, Any updates?

fuyufjh avatar Apr 08 '24 09:04 fuyufjh

@cyliu0 could you run one more time but with more resources?

I think the point of this test is here is just to make sure that non-shared PG CDC sources don't block itself somehow, but if the slowness/freeze is caused by query, then it does not matter.

lmatz avatar Apr 08 '24 09:04 lmatz

@cyliu0 could you run one more time but with more resources?

Hitting this while running with bigger memory on nightly-20240408

compactor = { limit = "12Gi", request = "12Gi" }
compute = { limit = "24Gi", request = "24Gi" }

https://grafana.test.risingwave-cloud.xyz/d/EpkBw5W4k/risingwave-dev-dashboard?orgId=1&var-datasource=P2453400D1763B4D9&var-namespace=ch-benchmark-pg-cdc-pipeline&var-instance=benchmark-risingwave&var-pod=All&var-component=All&var-table=All&from=1712630508369&to=1712632212485 image

but if the slowness/freeze is caused by query, then it does not matter.

It's caused by ch-benchmark q3 in this case.

@StrikeW Shall we keep this issue for future enhancements? Or close it now?

cyliu0 avatar Apr 09 '24 03:04 cyliu0

but if the slowness/freeze is caused by query, then it does not matter.

It's caused by ch-benchmark q3 in this case.

@StrikeW Shall we keep this issue for future enhancements? Or close it now?

Optimize the query performance should be tracked by other issue. let's close this one.

StrikeW avatar Apr 26 '24 06:04 StrikeW

The issue still exists with nightly-20240507. Which issue covered this right now? @StrikeW https://buildkite.com/risingwave-test/ch-benchmark-pg-cdc-shared-source/builds/41#018f5655-3749-4029-a4b8-cc0c321eb18a image

cyliu0 avatar May 08 '24 06:05 cyliu0

The issue still exists with nightly-20240507. Which issue covered this right now? @StrikeW https://buildkite.com/risingwave-test/ch-benchmark-pg-cdc-shared-source/builds/41#018f5655-3749-4029-a4b8-cc0c321eb18a image

There is no new issue for the performance problem.

The source is backpressured and could you confirm that the CN is configured with 16GB memory? image

It seems the bottleneck is in the state store due to the amount of L0 files and lead to higher sync duration. https://grafana.test.risingwave-cloud.xyz/d/EpkBw5W4k/risingwave-dev-dashboard?from=1715140652391&orgId=1&to=1715143112711&var-component=All&var-datasource=P2453400D1763B4D9&var-instance=benchmark-risingwave&var-namespace=tpc-20240508-035346&var-pod=All&var-table=All

StrikeW avatar May 08 '24 06:05 StrikeW

It's 13GB for compute node memory. But it seems like enough because the top memory usage is around 9GB here. image image

cyliu0 avatar May 08 '24 06:05 cyliu0

I think this should be a performance issue instead of functionality bug, so I suggest you can create a new issue for it and post to perf working group.

StrikeW avatar May 08 '24 06:05 StrikeW