cockroach
cockroach copied to clipboard
changefeedccl: parallelio metrics improvements
The ParallelIO metric changefeed.parallel_io_pending_rows seems inaccurate. For example, in a recent escalation we saw >3M pending rows for a running changefeed whose watched table received few updates.
Additionally, let's add a new gauge metric that tracks parallelio parallelism.
Jira issue: CRDB-51171
Hi @rharding6373, please add branch-* labels to identify which branch(es) this C-bug affects.
:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.
cc @cockroachdb/cdc
Re: changefeed.parallel_io_pending_rows, I've looked through the code and it looks sound. ParallelIO is single threaded, and the increments and decrements to the metric are done atomically so I don't think there is any race conditions there. I've also tried another approach where we count all of the keys in pending slice, but the value is the exact same as how we currently do it.
I don't see any information about this metric in the escalation ticket nor in the RCA ticket.
@asg0451 Do you have any context on this?
From the tsdump from the escalation, I do see that parallel_io_pending_rows is significantly higher than sink_io_inflight. Here sink_io_inflight is scaled up 10x:
I can see this happening if a very small set of keys is receiving a large amount of updates.
I dont recall the exact context, sorry. from rachael's comment it seems like millions of pending rows was not expected based on the workload.
if you've reviewed the tsdump and think everything looks good, that's an okay outcome. can you also add the metric mentioned in the body of this issue tracking the parallelism (a log might also be acceptable)
Actually I think the issue is that we're keeping track of the number of messages instead of the number of keys, which would explain the overcounting because number of keys <= number of messages.
is that an issue? i dont think so. knowing how many messages is useful.
I have a couple reasons to believe that the number of keys is what the metric is intended for. See PR description for #154458
yeah but the name of the metric is pending_rows, not pending_keys. that's strong enough evidence itself no? maybe it's the other stuff that needs to be adjusted to reduce confusion.
which version do you think provides the most value?
On Tue, Sep 30, 2025 at 2:52 PM Keith Chow @.***> wrote:
KeithCh left a comment (cockroachdb/cockroach#147625) https://github.com/cockroachdb/cockroach/issues/147625#issuecomment-3353403946
I have a couple reasons to believe that the number of keys is what the metric is intended for. See PR description for #154458 https://github.com/cockroachdb/cockroach/pull/154458
— Reply to this email directly, view it on GitHub https://github.com/cockroachdb/cockroach/issues/147625#issuecomment-3353403946, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABBIDEIFXF7R6MCP4L2M4GD3VLGNPAVCNFSM6AAAAAB6NM5DTOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTGNJTGQYDGOJUGY . You are receiving this because you were mentioned.Message ID: @.***>
1 minute ago via email
TIL
Hmm I guess you can infer the number of pending keys from inflight keys and pending rows. I'll put up a PR to fix the naming then.
What does parallelism mean in this context? How does it differ from sink_io_inflight
i believe its referring to the num_workers setting that the feed is using
Hi @KeithCh, please add a branch-* label to identify the earliest affected branch for this C-bug
:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.