vitess
vitess copied to clipboard
Add sql text counts stats to `vtcombo`,`vtgate`+`vttablet`
Description
This PR adds stats for the size of query text processed by vtgate and vttablet
This is useful to understand large queries that potentially cause performance, grpc message size, bandwidth problems
Related Issue(s)
Resolves https://github.com/vitessio/vitess/issues/15929
Checklist
- [x] "Backport to:" labels have been added if this change should be back-ported to release branches
- [x] If this change is to be back-ported to previous releases, a justification is included in the PR description
- [x] Tests were added or are not required
- [ ] Did the new or modified tests pass consistently locally and on CI?
- [ ] Documentation was added or is not required
Deployment Notes
Review Checklist
Hello reviewers! :wave: Please follow this checklist when reviewing this Pull Request.
General
- [ ] Ensure that the Pull Request has a descriptive title.
- [ ] Ensure there is a link to an issue (except for internal cleanup and flaky test fixes), new features should have an RFC that documents use cases and test cases.
Tests
- [ ] Bug fixes should have at least one unit or end-to-end test, enhancement and new features should have a sufficient number of tests.
Documentation
- [ ] Apply the
release notes (needs details)label if users need to know about this change. - [ ] New features should be documented.
- [ ] There should be some code comments as to why things are implemented the way they are.
- [ ] There should be a comment at the top of each new or modified test to explain what the test does.
New flags
- [ ] Is this flag really necessary?
- [ ] Flag names must be clear and intuitive, use dashes (
-), and have a clear help text.
If a workflow is added or modified:
- [ ] Each item in
Jobsshould be named in order to mark it asrequired. - [ ] If the workflow needs to be marked as
required, the maintainer team must be notified.
Backward compatibility
- [ ] Protobuf changes should be wire-compatible.
- [ ] Changes to
_vttables and RPCs need to be backward compatible. - [ ] RPC changes should be compatible with vitess-operator
- [ ] If a flag is removed, then it should also be removed from vitess-operator and arewefastyet, if used there.
- [ ]
vtctlcommand output order should be stable andawk-able.
Opened this one with a lot of failing CI to see if others have ideas how to debug
The failures are confusing and feel unrelated to this change 🤔
Opened this one with a lot of failing CI to see if others have ideas how to debug
The failures are confusing and feel unrelated to this change 🤔
There failures are related to the changes made. I have pushed a commit that fixes some of those. You should run the test locally and fix the remaining if any.
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 68.22%. Comparing base (
0353ad4) to head (0531ce1). Report is 59 commits behind head on main.
Additional details and impacted files
@@ Coverage Diff @@
## main #15897 +/- ##
==========================================
- Coverage 68.45% 68.22% -0.24%
==========================================
Files 1559 1541 -18
Lines 196825 197330 +505
==========================================
- Hits 134736 134627 -109
- Misses 62089 62703 +614
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Opened this one with a lot of failing CI to see if others have ideas how to debug The failures are confusing and feel unrelated to this change 🤔
There failures are related to the changes made. I have pushed a commit that fixes some of those.
@harshit-gangal indeed it is, thanks! 🤦 🙇
You should run the test locally and fix the remaining if any.
Good point, I've made a bad habit of relying on GitHub Actions CI because many go test CIs fail on my MacBook when they rely on external things (perhaps not this test). I'll give running CI locally another shot 👀
Somehow this was not obvious from the CI outputs I looked at, including this Unit test run that ends in just:
+ go-junit-report -set-exit-code
make: *** [Makefile:209: unit_test] Error 1
Error: Process completed with exit code 1.
Perhaps it's just the v15 fork/release I'm working on, but I'm used to the Unit test CI including the output of go test -v to help explain the failure. I'll see if something about that changed and if there is an alternative
Perhaps it's just the v15 fork/release I'm working on, but I'm used to the Unit test CI including the output of go test -v to help explain the failure. I'll see if something about that changed and if there is an alternative
The first step records the output, if you click on the step below in the CI you will see the output.
Perhaps it's just the v15 fork/release I'm working on, but I'm used to the Unit test CI including the output of go test -v to help explain the failure. I'll see if something about that changed and if there is an alternative
The first step records the output, if you click on the step below in the CI you will see the output.
@harshit-gangal ahh, that's where it goes! Thanks a lot ~~more on this one soon~~
I removed the website docs label. We don't actually have a page that lists all available metrics, so there's no good place to add docs for new metrics.
Since the PR has two approvals I am adding the Do Not Merge label so we don't merge this by accident before @deepthi's concern is answered.
@deepthi / @frouioui, here are the metric outputs:
tvaillancourt@REDACTED:~$ curl -s localhost:15000/debug/vars | jq .QuerySQLTextCounts
{
"Join.Select.": 3234,
"dual.Select.": 3615
}
tvaillancourt@REDACTED:~$ curl -s localhost:15000/metrics 2>&1 | grep vttablet_query_sql_text_counts
# HELP vttablet_query_sql_text_counts query sql text counts
# TYPE vttablet_query_sql_text_counts counter
vttablet_query_sql_text_counts{plan="Select",table="Join",workload=""} 3234
vttablet_query_sql_text_counts{plan="Select",table="dual",workload=""} 3650
So looks good 👍. While doing that I thought of two potential alternatives in case this didn't work: QueryTextProcessed or QueryTextRead, any thoughts?
Somewhat repeating @frouioui , will merge once @deepthi 's concern is addressed (likely you need o change
...SQL...to...Sql...and will then merge.
@shlomi-noach the potential issue of s_q_l that Deepthi pointed out didn't seem to be a problem in practice, here is the metric output:
tvaillancourt@REDACTED:~$ curl -s localhost:15000/metrics 2>&1 | grep vttablet_query_sql_text_counts
# HELP vttablet_query_sql_text_counts query sql text counts
# TYPE vttablet_query_sql_text_counts counter
vttablet_query_sql_text_counts{plan="Select",table="Join",workload=""} 3234
vttablet_query_sql_text_counts{plan="Select",table="dual",workload=""} 3650
Let me know if any change is needed. I'll add a changelog/ note once we agree on the metric name and I think we're done! 🙇
@shlomi-noach please feel free to dismiss my review if @timvaillancourt is able complete the requested changes before code freeze.
OK, now I understand the intent of this metric, it won't work as implemented. Counters should only be used for monotonically increasing numbers, prometheus will not let you reduce the value of a "counter".
@deepthi the intent of this metric is a counter that increments forever so we can infer the rate of query text processed at any given time. I don't believe there is a need to ever reduce this metric, it only increases from the time the daemon starts like other counters
EDIT: here we can see the counter is only incremented by 0 or more: vtg.sqlTextCounts.Add(statsKey, int64(len(sql)))
OK, now I understand the intent of this metric, it won't work as implemented. Counters should only be used for monotonically increasing numbers, prometheus will not let you reduce the value of a "counter".
@deepthi the intent of this metric is a counter that increments forever so we can infer the rate of query text processed at any given time. I don't believe there is a need to ever reduce this metric, it only increases from the time the daemon starts like other counters
EDIT: here we can see the counter is only incremented by 0 or more:
vtg.sqlTextCounts.Add(statsKey, int64(len(sql)))
I think the term count is misleading in this context. Count is typically how many of an entity we encountered, so for example total number of queries. Here we are adding the length of the query each time. A counter will work for this, but I do request that the name be changed to avoid this kind of confusion.
@timvaillancourt to summarize the current state: the changes make sense, and the metric looks correct. But if you could please rename it to something such as QueryProcessedTextLengths? And with that we can merge the PR.
But if you could please rename it to something such as QueryProcessedTextLengths
@shlomi-noach / @deepthi I felt Lengths is a bit awkward so I've gone with QueryTextCharactersProcessed which I think is long but very clear. I've also added this to the 20.0.0 changelog
Rerequested your review 🙇
Some minor wording changes to release notes are required to avoid user confusion. Rest LGTM. @shlomi-noach you may dismiss my review once these changes are done.