cockroach icon indicating copy to clipboard operation
cockroach copied to clipboard

changefeedccl: parallelio metrics improvements

Open rharding6373 opened this issue 6 months ago • 2 comments

The ParallelIO metric changefeed.parallel_io_pending_rows seems inaccurate. For example, in a recent escalation we saw >3M pending rows for a running changefeed whose watched table received few updates.

Additionally, let's add a new gauge metric that tracks parallelio parallelism.

Jira issue: CRDB-51171

rharding6373 avatar Jun 02 '25 16:06 rharding6373

Hi @rharding6373, please add branch-* labels to identify which branch(es) this C-bug affects.

:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

blathers-crl[bot] avatar Jun 02 '25 16:06 blathers-crl[bot]

cc @cockroachdb/cdc

blathers-crl[bot] avatar Jun 02 '25 16:06 blathers-crl[bot]

Re: changefeed.parallel_io_pending_rows, I've looked through the code and it looks sound. ParallelIO is single threaded, and the increments and decrements to the metric are done atomically so I don't think there is any race conditions there. I've also tried another approach where we count all of the keys in pending slice, but the value is the exact same as how we currently do it.

I don't see any information about this metric in the escalation ticket nor in the RCA ticket.

@asg0451 Do you have any context on this?

KeithCh avatar Sep 29 '25 21:09 KeithCh

From the tsdump from the escalation, I do see that parallel_io_pending_rows is significantly higher than sink_io_inflight. Here sink_io_inflight is scaled up 10x:

Image

KeithCh avatar Sep 29 '25 21:09 KeithCh

I can see this happening if a very small set of keys is receiving a large amount of updates.

KeithCh avatar Sep 29 '25 21:09 KeithCh

I dont recall the exact context, sorry. from rachael's comment it seems like millions of pending rows was not expected based on the workload.

asg0451 avatar Sep 29 '25 22:09 asg0451

if you've reviewed the tsdump and think everything looks good, that's an okay outcome. can you also add the metric mentioned in the body of this issue tracking the parallelism (a log might also be acceptable)

asg0451 avatar Sep 29 '25 22:09 asg0451

Actually I think the issue is that we're keeping track of the number of messages instead of the number of keys, which would explain the overcounting because number of keys <= number of messages.

KeithCh avatar Sep 30 '25 17:09 KeithCh

is that an issue? i dont think so. knowing how many messages is useful.

asg0451 avatar Sep 30 '25 18:09 asg0451

I have a couple reasons to believe that the number of keys is what the metric is intended for. See PR description for #154458

KeithCh avatar Sep 30 '25 18:09 KeithCh

yeah but the name of the metric is pending_rows, not pending_keys. that's strong enough evidence itself no? maybe it's the other stuff that needs to be adjusted to reduce confusion.

which version do you think provides the most value?

On Tue, Sep 30, 2025 at 2:52 PM Keith Chow @.***> wrote:

KeithCh left a comment (cockroachdb/cockroach#147625) https://github.com/cockroachdb/cockroach/issues/147625#issuecomment-3353403946

I have a couple reasons to believe that the number of keys is what the metric is intended for. See PR description for #154458 https://github.com/cockroachdb/cockroach/pull/154458

— Reply to this email directly, view it on GitHub https://github.com/cockroachdb/cockroach/issues/147625#issuecomment-3353403946, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABBIDEIFXF7R6MCP4L2M4GD3VLGNPAVCNFSM6AAAAAB6NM5DTOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTGNJTGQYDGOJUGY . You are receiving this because you were mentioned.Message ID: @.***>

asg0451 avatar Sep 30 '25 20:09 asg0451

1 minute ago via email

TIL

Hmm I guess you can infer the number of pending keys from inflight keys and pending rows. I'll put up a PR to fix the naming then.

KeithCh avatar Sep 30 '25 20:09 KeithCh

What does parallelism mean in this context? How does it differ from sink_io_inflight

KeithCh avatar Sep 30 '25 20:09 KeithCh

i believe its referring to the num_workers setting that the feed is using

asg0451 avatar Oct 01 '25 14:10 asg0451

Hi @KeithCh, please add a branch-* label to identify the earliest affected branch for this C-bug

:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

blathers-crl[bot] avatar Oct 03 '25 19:10 blathers-crl[bot]