cockroach sql: SQL Stats cleanup job too slow

We have observed that sometimes SQL Stats cleanup job runs way too slow. (sometimes 12 hours+). This can be problematic when the are sql stats flush rate is high. We need to find a way to speed up SQL Stats cleanup job.

cc: @ajwerner

Update: action items: (cc @mgartner )

[x] Knob to tune the 128 row limit.
- #80191
[x] Update the query to skip over garbage.
- #80341
[x] Add timestamp filter to query to prevent deleting too many rows.
- #80515
[ ] Remove the initial count query and instead have the for loop delete queries until < 128 (or the setting value) are deleted.
- Partially addressed by #80515
[ ] Consider increasing the number of shards in the two tables in future versions.

Jira issue: CRDB-14902

Apr 06 '22 22:04 Azhng

My theory on why the stats job gets slower over time is that for each iteration of the loop, it scans over all not-yet-GC'd data from all previous iterations of the loop (including from this job). See https://github.com/cockroachdb/cockroach/blob/3fe4e08d7419a9071af7853a57e14527fcde3ecb/pkg/sql/sqlstats/persistedsqlstats/compaction_exec.go#L216-L223

The problem with this statement, is each time it runs, it scans the shard from the oldest to the newest. It could, however, track the primary key of the oldest row it deleted in the previous index and use that as the starting point for the next iteration. This can reduce the amount of data read by a massive amount. A further optimization could be to track the last row deleted from the previous iteration, but that'll only save you a small amount (one of the scans you were doing each iteration). If the loop ran 1000 times, it would only save you 1/1000 in terms of work.

If we were concerned overall about the throughput of this job after this optimization I suggest above, the next thing I'd consider is parallelizing the deletions across the shards. That could, in the limit, bring a substantial speedup, but would need to be done carefully to avoid overloading the cluster. All of this remind me, we should probably be setting admission control headers in this job to be deprioritized relative to foreground traffic.

Apr 06 '22 22:04 ajwerner

:information_source: Hello! I am a human and not at all a robot! Look at my very human username! :robot: :notes: :thinking: Although I tried very hard to figure out what to do with this issue, more powerful human brains will need to help me. (specifically: Both Github and Jira issues were modified after exalate problems) :confounded: :arrows_counterclockwise: Please visit this issue's mirror at CRDB-14902 and try to sync the two sides up manually. :star2: :white_check_mark: When you're finished, comment saying as much asn a member of Developer Infrastructure will be along to finish linking. :link: :no_entry_sign: Note that until this is done, this issue is not and will not be synced to Jira with Exalate. :no_entry_sign: :sweat_smile: Feeling lost? Don't worry about it! A member of @cockroachdb/exalate-22-cleanup-team will be along shortly to help! :+1: :construction_worker: Developer Infrastructure members: when ready, open Exalate from the right-hand menu of the mirror issue in Jira, then choose Connect and enter this issue's URN: cockroachdb/cockroach-79548. Either way, delete this comment when you're done. :key: :pray: Thank you for your compliance, my fellow humans! :robot: :wave:

May 13 '22 21:05 mari-crl

Manually synced with Jira

May 25 '22 12:05 rail

cockroach cockroach copied to clipboard

sql: SQL Stats cleanup job too slow

cockroach
cockroach copied to clipboard